Logical replication launcher uses wal_retrieve_retry_interval

Started by Masahiko Sawadaover 8 years ago9 messages

sawada.mshk@gmail.com

over 8 years ago

Hi,

I noticed that the logical replication launcher uses
wal_retrieve_retry_interval as a interval of launching logical
replication worker process. This behavior is not documented and I
guess this is no longer consistent with what its name means.

I think that we should either introduce a new GUC parameter (say
logical_replication_retry_interval?) for this or update the
description of wal_retrieve_retry_interval. IMO the former is better.

Thought?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr.jelinek@2ndquadrant.com

over 8 years ago

In reply to: Masahiko Sawada (#1)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On 14/04/17 12:57, Masahiko Sawada wrote:

Hi,

I noticed that the logical replication launcher uses
wal_retrieve_retry_interval as a interval of launching logical
replication worker process. This behavior is not documented and I
guess this is no longer consistent with what its name means.

Yes that was done based on reviews (and based on general attitude of not
adding more knobs that are similar in meaning). It is briefly documented
in the replication config section. Same is true for wal_receiver_timeout
btw.

I think that we should either introduce a new GUC parameter (say
logical_replication_retry_interval?) for this or update the
description of wal_retrieve_retry_interval. IMO the former is better.

I am not quite sure adding more GUCs is all that great option. When
writing the patches I was wondering if we should perhaps rename the
wal_receiver_timeout and wal_retrieve_retry_interval to something that
makes more sense for both physical and logical replication though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#2)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On Fri, Apr 14, 2017 at 9:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

I am not quite sure adding more GUCs is all that great option. When
writing the patches I was wondering if we should perhaps rename the
wal_receiver_timeout and wal_retrieve_retry_interval to something that
makes more sense for both physical and logical replication though.

It seems to me that you should really have a different GUC,
wal_retrieve_retry_interval has been designed to work in the startup
process, and I think that it should still only behave as originally
designed. And at some point I think that it would make as well sense
to be able to make this parameter settable at worker-level.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#2)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On Fri, Apr 14, 2017 at 9:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 14/04/17 12:57, Masahiko Sawada wrote:

Hi,

I noticed that the logical replication launcher uses
wal_retrieve_retry_interval as a interval of launching logical
replication worker process. This behavior is not documented and I
guess this is no longer consistent with what its name means.

Yes that was done based on reviews (and based on general attitude of not
adding more knobs that are similar in meaning). It is briefly documented
in the replication config section. Same is true for wal_receiver_timeout
btw.

Thank you for the comment!

These two parameters are classed as a standby server parameter in the
document. We might want to fix it.

I think that we should either introduce a new GUC parameter (say
logical_replication_retry_interval?) for this or update the
description of wal_retrieve_retry_interval. IMO the former is better.

I am not quite sure adding more GUCs is all that great option. When
writing the patches I was wondering if we should perhaps rename the
wal_receiver_timeout and wal_retrieve_retry_interval to something that
makes more sense for both physical and logical replication though.

It would work but IMO having multiple different behaviors for one GUC
parameter is not good design.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr.jelinek@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#3)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On 14/04/17 14:30, Michael Paquier wrote:

On Fri, Apr 14, 2017 at 9:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

I am not quite sure adding more GUCs is all that great option. When
writing the patches I was wondering if we should perhaps rename the
wal_receiver_timeout and wal_retrieve_retry_interval to something that
makes more sense for both physical and logical replication though.

It seems to me that you should really have a different GUC,
wal_retrieve_retry_interval has been designed to work in the startup
process, and I think that it should still only behave as originally
designed.

Ah yeah I am actually confusing it with wal_receiver_timeout which
behaves same for wal_receiver and subscription worker. So yeah it makes
sense to have separate GUC (I wonder if we then need yet another one for
tablesync though since both of those will be controlling restarts of
subscription workers after crash).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#5)

2 attachment(s)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On Fri, Apr 14, 2017 at 9:59 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 14/04/17 14:30, Michael Paquier wrote:

On Fri, Apr 14, 2017 at 9:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

I am not quite sure adding more GUCs is all that great option. When
writing the patches I was wondering if we should perhaps rename the
wal_receiver_timeout and wal_retrieve_retry_interval to something that
makes more sense for both physical and logical replication though.

It seems to me that you should really have a different GUC,
wal_retrieve_retry_interval has been designed to work in the startup
process, and I think that it should still only behave as originally
designed.

Ah yeah I am actually confusing it with wal_receiver_timeout which
behaves same for wal_receiver and subscription worker. So yeah it makes
sense to have separate GUC

Attached two patches add new GUCs apply_worker_timeout and
apply_worker_launch_interval which are used instead of
wal_receiver_timeout and wal_retrieve_retry_timeout. These new
parameters are not settable at worker-level so far.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

0001-Add-a-GUC-parameter-apply_worker_timeout.patchapplication/octet-stream; name=0001-Add-a-GUC-parameter-apply_worker_timeout.patchDownload

From 4226b471e1f88f0e5c55715c46671baf13d8c42d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2017 10:44:27 +0900
Subject: [PATCH 1/2] Add a GUC parameter apply_worker_timeout.

Terminate replication connections that are inactive longer than
the specified number of milliseconds.
---
 doc/src/sgml/config.sgml                      | 31 +++++++++++++++++++++++----
 src/backend/replication/logical/worker.c      | 11 ++++++----
 src/backend/utils/misc/guc.c                  | 12 +++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/replication/logicalworker.h       |  2 ++
 5 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 744c5e8..f584431 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3421,9 +3421,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
      </para>
 
      <para>
-      Note that <varname>wal_receiver_timeout</varname> and
-      <varname>wal_retrieve_retry_interval</varname> configuration parameters
-      affect the logical replication workers as well.
+      Note that <varname>wal_retrieve_retry_interval</varname> configuration parameters
+      affects the logical replication workers as well.
      </para>
 
      <variablelist>
@@ -3474,7 +3473,31 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
-     </variablelist>
+     <varlistentry id="guc-apply-worker-timeout" xreflabel="apply_worker_timeout">
+      <term><varname>apply_worker_timeout</varname> (<type>integer</type>)        
+      <indexterm>                                                                 
+       <primary><varname>apply_worker_timeout</> configuration parameter</primary>
+      </indexterm>                                                                
+      </term>                                                                     
+      <listitem>                                                                  
+       <para>                                                                     
+        Terminate replication connections that are inactive longer                
+        than the specified number of milliseconds. This is useful for             
+        the receiving subscriber to detect a publisher node crash or network      
+        outage.                                                                   
+       </para>
+       <para>
+        A value of zero disables the timeout mechanism.  This parameter           
+        can only be set in                                                        
+        the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+       <para>
+        The default value is 60 seconds.                                          
+       </para>                                                                    
+      </listitem>                                                                 
+     </varlistentry>                                                              
+
+    </variablelist>
     </sect2>
 
    </sect1>
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 656d399..8d65837 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -86,6 +86,9 @@
 
 #define NAPTIME_PER_CYCLE 1000	/* max sleep time between cycles (1s) */
 
+/* GUC variable */
+int	apply_worker_timeout;
+
 typedef struct FlushPosition
 {
 	dlist_node node;
@@ -1150,7 +1153,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 			/*
 			 * We didn't receive anything new. If we haven't heard
 			 * anything from the server for more than
-			 * wal_receiver_timeout / 2, ping the server. Also, if
+			 * apply_worker_timeout / 2, ping the server. Also, if
 			 * it's been longer than wal_receiver_status_interval
 			 * since the last update we sent, send a status update to
 			 * the master anyway, to report any progress in applying
@@ -1162,14 +1165,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 			 * Check if time since last receive from standby has
 			 * reached the configured limit.
 			 */
-			if (wal_receiver_timeout > 0)
+			if (apply_worker_timeout > 0)
 			{
 				TimestampTz now = GetCurrentTimestamp();
 				TimestampTz timeout;
 
 				timeout =
 					TimestampTzPlusMilliseconds(last_recv_timestamp,
-												wal_receiver_timeout);
+												apply_worker_timeout);
 
 				if (now >= timeout)
 					ereport(ERROR,
@@ -1182,7 +1185,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				if (!ping_sent)
 				{
 					timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
-														  (wal_receiver_timeout / 2));
+														  (apply_worker_timeout / 2));
 					if (now >= timeout)
 					{
 						requestReply = true;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9ad8361..378bad2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -63,6 +63,7 @@
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "replication/walreceiver.h"
@@ -1828,6 +1829,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"apply_worker_timeout", PGC_SIGHUP, REPLICATION_SUBSCRIBERS,
+			gettext_noop("Sets the maximum wait time to receive data from the publisher."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&apply_worker_timeout,
+		60 * 1000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
 			gettext_noop("Sets the maximum number of concurrent connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1435d92..20b5bb7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -278,6 +278,8 @@
 
 #max_logical_replication_workers = 4	# taken from max_worker_processes
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
+#apply_worker_timeout = 60s		# time that apply worker waits for
+					# communication from publisher#
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index 3e0affa..6395850 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -12,6 +12,8 @@
 #ifndef LOGICALWORKER_H
 #define LOGICALWORKER_H
 
+extern int apply_worker_timeout;
+
 extern void ApplyWorkerMain(Datum main_arg);
 
 #endif   /* LOGICALWORKER_H */
-- 
2.8.1

0002-Add-a-new-GUC-parameter-apply_worker_launch_interval.patchapplication/octet-stream; name=0002-Add-a-new-GUC-parameter-apply_worker_launch_interval.patchDownload

From a16877bf24fd80a421cf96d8807dea6953accb94 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2017 10:57:09 +0900
Subject: [PATCH 2/2] Add a new GUC parameter apply_worker_launch_interval.

Specify the inteval of launching logical replication apply worker.
---
 doc/src/sgml/config.sgml                      | 22 +++++++++++++++++-----
 src/backend/replication/logical/launcher.c    | 16 +++++++++-------
 src/backend/utils/misc/guc.c                  | 13 +++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/replication/logicallauncher.h     |  1 +
 5 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f584431..92e8b0f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3420,11 +3420,6 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       Their values on the publisher are irrelevant.
      </para>
 
-     <para>
-      Note that <varname>wal_retrieve_retry_interval</varname> configuration parameters
-      affects the logical replication workers as well.
-     </para>
-
      <variablelist>
 
      <varlistentry id="guc-max-logical-replication-workers" xreflabel="max_logical_replication_workers">
@@ -3497,6 +3492,23 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>                                                                 
      </varlistentry>                                                              
 
+     <varlistentry id="guc-apply-worker-launch-interval" xreflabel="apply_worker_launch_interval">
+      <term><varname>apply_worker_launch_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>apply_worker_launch_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify the number of milliseconds after which a logical replication apply
+        worker launched.
+       </para>
+       <para>
+        The default value is 5 seconds. Units are milliseconds if not specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
     </sect2>
 
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 2d663f6..6db8793 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -58,6 +58,7 @@
 
 int	max_logical_replication_workers = 4;
 int max_sync_workers_per_subscription = 2;
+int	apply_worker_launch_interval = 5000;
 
 LogicalRepWorker *MyLogicalRepWorker = NULL;
 
@@ -684,9 +685,9 @@ ApplyLauncherMain(Datum main_arg)
 
 		now = GetCurrentTimestamp();
 
-		/* Limit the start retry to once a wal_retrieve_retry_interval */
+		/* Limit the start retry to once a apply_worker_launch_interval */
 		if (TimestampDifferenceExceeds(last_start_time, now,
-									   wal_retrieve_retry_interval))
+									   apply_worker_launch_interval))
 		{
 			/* Use temporary context for the database list and worker info. */
 			subctx = AllocSetContextCreate(TopMemoryContext,
@@ -714,7 +715,7 @@ ApplyLauncherMain(Datum main_arg)
 					logicalrep_worker_launch(sub->dbid, sub->oid, sub->name,
 											 sub->owner, InvalidOid);
 					last_start_time = now;
-					wait_time = wal_retrieve_retry_interval;
+					wait_time = apply_worker_launch_interval;
 					/* Limit to one worker per mainloop cycle. */
 					break;
 				}
@@ -729,11 +730,12 @@ ApplyLauncherMain(Datum main_arg)
 		{
 			/*
 			 * The wait in previous cycle was interrupted in less than
-			 * wal_retrieve_retry_interval since last worker was started,
-			 * this usually means crash of the worker, so we should retry
-			 * in wal_retrieve_retry_interval again.
+			 * apply_worker_launch_interval since last
+			 * worker was started, this usually means crash of the worker,
+			 * so we should retry in apply_worker_launch_interval
+			 * again.
 			 */
-			wait_time = wal_retrieve_retry_interval;
+			wait_time = apply_worker_launch_interval;
 		}
 
 		/* Wait for more work. */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 378bad2..31e76e9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2547,6 +2547,19 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"apply_worker_launch_interval",
+			PGC_SIGHUP,
+			REPLICATION_SUBSCRIBERS,
+			gettext_noop("Sets the time to wait before launching aply woker."),
+			NULL,
+			GUC_UNIT_MS,
+		},
+		&apply_worker_launch_interval,
+		5000, 1, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"log_rotation_age", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Automatic log file rotation will occur after N minutes."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 20b5bb7..7bebd63 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -280,6 +280,7 @@
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 #apply_worker_timeout = 60s		# time that apply worker waits for
 					# communication from publisher#
+#apply_worker_launch_interval = 5s	# time to wait before start apply worker
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/replication/logicallauncher.h b/src/include/replication/logicallauncher.h
index 060946a..696a9e5 100644
--- a/src/include/replication/logicallauncher.h
+++ b/src/include/replication/logicallauncher.h
@@ -14,6 +14,7 @@
 
 extern int max_logical_replication_workers;
 extern int max_sync_workers_per_subscription;
+extern int apply_worker_launch_interval;
 
 extern void ApplyLauncherRegister(void);
 extern void ApplyLauncherMain(Datum main_arg);
-- 
2.8.1

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 8 years ago

In reply to: Masahiko Sawada (#6)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On 4/16/17 22:40, Masahiko Sawada wrote:

Attached two patches add new GUCs apply_worker_timeout and
apply_worker_launch_interval which are used instead of
wal_receiver_timeout and wal_retrieve_retry_timeout. These new
parameters are not settable at worker-level so far.

Under what circumstances are these needed? Does anyone ever set these?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr.jelinek@2ndquadrant.com

over 8 years ago

In reply to: Peter Eisentraut (#7)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On 18/04/17 16:24, Peter Eisentraut wrote:

On 4/16/17 22:40, Masahiko Sawada wrote:

Attached two patches add new GUCs apply_worker_timeout and
apply_worker_launch_interval which are used instead of
wal_receiver_timeout and wal_retrieve_retry_timeout. These new
parameters are not settable at worker-level so far.

Under what circumstances are these needed? Does anyone ever set these?

Personally I don't see need for apply_worker_timeout, no idea why that
can't use wal_receiver_timeout, the mechanics are exactly same, and it's
IMHO only needed because default tcp keepalive settings are usually too
generous. As for apply_worker_launch_interval, I think we want different
name so that it can be used for tablesync rate limiting as well.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 8 years ago

In reply to: Petr Jelinek (#8)

Re: Logical replication launcher uses wal_retrieve_retry_interval

On 4/18/17 12:00, Petr Jelinek wrote:

As for apply_worker_launch_interval, I think we want different
name so that it can be used for tablesync rate limiting as well.

But that's a mechanism we don't have yet, so maybe we should design that
when we get there?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers