pg_receivexlog stops upon server restart

Started by Thom Brownalmost 14 years ago8 messages
#1Thom Brown
thom@linux.com

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

--
Thom

#2Magnus Hagander
magnus@hagander.net
In reply to: Thom Brown (#1)
Re: pg_receivexlog stops upon server restart

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not
a malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to
the open items list.

//Magnus

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#3Thom Brown
thom@linux.com
In reply to: Magnus Hagander (#2)
Re: pg_receivexlog stops upon server restart

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net> wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not a
malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to the
open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

--
Thom

#4Magnus Hagander
magnus@hagander.net
In reply to: Thom Brown (#3)
1 attachment(s)
Re: pg_receivexlog stops upon server restart

On Thu, Apr 19, 2012 at 1:00 PM, Thom Brown <thom@linux.com> wrote:

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net> wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not a
malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to the
open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

Attached is a patch that implements this. Seems reasonable?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachments:

pg_receivexlog_loop.patchapplication/octet-stream; name=pg_receivexlog_loop.patchDownload
diff --git a/doc/src/sgml/ref/pg_receivexlog.sgml b/doc/src/sgml/ref/pg_receivexlog.sgml
index 8e5fca4..ee31065 100644
--- a/doc/src/sgml/ref/pg_receivexlog.sgml
+++ b/doc/src/sgml/ref/pg_receivexlog.sgml
@@ -58,6 +58,14 @@ PostgreSQL documentation
    configured with <xref linkend="guc-max-wal-senders"> set high enough to
    leave at least one session available for the stream.
   </para>
+
+  <para>
+   If the connection is lost, or if it cannot be initially established,
+   with a non fatal error, <application>pg_receivexlog</application> will
+   retry the connection indefinitely, and reestablish streaming as soon
+   as possible. To avoid this behavior, use the <literal>-n</literal>
+   parameter.
+  </para>
  </refsect1>
 
  <refsect1>
@@ -87,6 +95,17 @@ PostgreSQL documentation
 
     <variablelist>
      <varlistentry>
+      <term><option>-n</option></term>
+      <term><option>--noloop</option></term>
+      <listitem>
+       <para>
+        Don't loop on connection errors. Instead, exit right away with
+        an error.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-v</option></term>
       <term><option>--verbose</option></term>
       <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 0289c4b..b08afbd 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -280,6 +280,9 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 
 	/* Get a second connection */
 	param->bgconn = GetConnection();
+	if (!param->bgconn)
+		/* Error message already written in GetConnection() */
+		exit(1);
 
 	/*
 	 * Always in plain format, so we can write to basedir/pg_xlog. But the
@@ -916,6 +919,9 @@ BaseBackup(void)
 	 * Connect in replication mode to the server
 	 */
 	conn = GetConnection();
+	if (!conn)
+		/* Error message already written in GetConnection() */
+		exit(1);
 
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 2134c87..c1fae27 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -33,9 +33,13 @@
 
 #include "getopt_long.h"
 
+/* Time to sleep between reconnection attempts */
+#define RECONNECT_SLEEP_TIME 5
+
 /* Global options */
 char	   *basedir = NULL;
 int			verbose = 0;
+int			noloop = 0;
 int			standby_message_timeout = 10;		/* 10 sec = default */
 volatile bool time_to_abort = false;
 
@@ -55,6 +59,7 @@ usage(void)
 	printf(_("\nOptions controlling the output:\n"));
 	printf(_("  -D, --dir=directory       receive xlog files into this directory\n"));
 	printf(_("\nGeneral options:\n"));
+	printf(_("  -n, --noloop              do not loop on connection lost\n"));
 	printf(_("  -v, --verbose             output verbose messages\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
 	printf(_("  -V, --version             output version information, then exit\n"));
@@ -223,6 +228,9 @@ StreamLog(void)
 	 * Connect in replication mode to the server
 	 */
 	conn = GetConnection();
+	if (!conn)
+		/* Error message already written in GetConnection() */
+		return;
 
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline and current xlog
@@ -326,7 +334,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "D:h:p:U:s:wWv",
+	while ((c = getopt_long(argc, argv, "D:h:p:U:s:nwWv",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -364,6 +372,9 @@ main(int argc, char **argv)
 					exit(1);
 				}
 				break;
+			case 'n':
+				noloop = 1;
+				break;
 			case 'v':
 				verbose++;
 				break;
@@ -406,7 +417,28 @@ main(int argc, char **argv)
 	pqsignal(SIGINT, sigint_handler);
 #endif
 
-	StreamLog();
+	while (true)
+	{
+		StreamLog();
+		if (time_to_abort)
+			/*
+			 * We've been Ctrl-C'ed. That's not an error, so exit without
+			 * an errorcode.
+			 */
+			exit(0);
+		else if (noloop)
+		{
+			fprintf(stderr, _("%s: disconnected.\n"), progname);
+			exit(1);
+		}
+		else
+		{
+			fprintf(stderr, _("%s: disconnected. Waiting %d seconds to try again\n"),
+					progname, RECONNECT_SLEEP_TIME);
+			pg_usleep(RECONNECT_SLEEP_TIME * 1000000);
+		}
+	}
 
-	exit(0);
+	/* Never get here */
+	exit(2);
 }
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index cc01537..1416faa 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -65,6 +65,11 @@ xmalloc0(int size)
 }
 
 
+/*
+ * Connect to the server. Returns a valid PGconn pointer if connected,
+ * or NULL on non-permanent error. On permanent error, the function will
+ * call exit(1) directly.
+ */
 PGconn *
 GetConnection(void)
 {
@@ -151,7 +156,7 @@ GetConnection(void)
 		{
 			fprintf(stderr, _("%s: could not connect to server: %s\n"),
 					progname, PQerrorMessage(tmpconn));
-			exit(1);
+			return NULL;
 		}
 
 		/* Connection ok! */
#5Thom Brown
thom@linux.com
In reply to: Magnus Hagander (#4)
Re: pg_receivexlog stops upon server restart

On 24 May 2012 13:05, Magnus Hagander <magnus@hagander.net> wrote:

On Thu, Apr 19, 2012 at 1:00 PM, Thom Brown <thom@linux.com> wrote:

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net> wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not a
malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to the
open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

Attached is a patch that implements this. Seems reasonable?

s/non fatal/non-fatal/

Yes, this solves the problem for me, except you forgot to translate
noloop in long_options[] . :)

--
Thom

#6Magnus Hagander
magnus@hagander.net
In reply to: Thom Brown (#5)
Re: pg_receivexlog stops upon server restart

On Thu, May 24, 2012 at 2:34 PM, Thom Brown <thom@linux.com> wrote:

On 24 May 2012 13:05, Magnus Hagander <magnus@hagander.net> wrote:

On Thu, Apr 19, 2012 at 1:00 PM, Thom Brown <thom@linux.com> wrote:

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net> wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not a
malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to the
open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

Attached is a patch that implements this. Seems reasonable?

s/non fatal/non-fatal/

Yes, this solves the problem for me, except you forgot to translate
noloop in long_options[] . :)

Fixed :-)

Did you test it, or just assumed it worked? ;)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#7Thom Brown
thom@linux.com
In reply to: Magnus Hagander (#6)
Re: pg_receivexlog stops upon server restart

On 24 May 2012 13:37, Magnus Hagander <magnus@hagander.net> wrote:

On Thu, May 24, 2012 at 2:34 PM, Thom Brown <thom@linux.com> wrote:

On 24 May 2012 13:05, Magnus Hagander <magnus@hagander.net> wrote:

On Thu, Apr 19, 2012 at 1:00 PM, Thom Brown <thom@linux.com> wrote:

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net> wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written, there's not a
malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and to have
an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless, which is
why I'm working on my email backlog, so I can't check), please add it to the
open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

Attached is a patch that implements this. Seems reasonable?

s/non fatal/non-fatal/

Yes, this solves the problem for me, except you forgot to translate
noloop in long_options[] . :)

Fixed :-)

Did you test it, or just assumed it worked? ;)

How very dare you. Of course I tested it. It successfully reconnects
on multiple restarts, checks intermittently when I've stopped the
server, showing the connection error message, successfully continues
when I eventually bring the server back up, and doesn't attempt a
reconnect when using -n.

So looks good to me.

--
Thom

#8Magnus Hagander
magnus@hagander.net
In reply to: Thom Brown (#7)
Re: pg_receivexlog stops upon server restart

On Thursday, May 24, 2012, Thom Brown wrote:

On 24 May 2012 13:37, Magnus Hagander <magnus@hagander.net <javascript:;>>
wrote:

On Thu, May 24, 2012 at 2:34 PM, Thom Brown <thom@linux.com<javascript:;>>

wrote:

On 24 May 2012 13:05, Magnus Hagander <magnus@hagander.net<javascript:;>>

wrote:

On Thu, Apr 19, 2012 at 1:00 PM, Thom Brown <thom@linux.com<javascript:;>>

wrote:

On 10 April 2012 21:07, Magnus Hagander <magnus@hagander.net<javascript:;>>

wrote:

On Friday, April 6, 2012, Thom Brown wrote:

Hi,

I've tried out pg_receivexlog and have noticed that when restarting
the cluster, pg_receivexlog gets cut off... it doesn't keep waiting.
This is surprising as the DBA would have to remember to start
pg_receivexlog up again.

This is intentional as far as that's how the code was written,

there's not a

malfunctioning piece of code somewhere.

It would probably make sense to have an auto-reconnect feature, and

to have

an option to turn it on/off.

If you haven't already (my wifi here is currently quite useless,

which is

why I'm working on my email backlog, so I can't check), please add

it to the

open items list.

I think it would also be useful to add a paragraph to the
documentation stating use-cases for this feature, and its advantages.

Attached is a patch that implements this. Seems reasonable?

s/non fatal/non-fatal/

Yes, this solves the problem for me, except you forgot to translate
noloop in long_options[] . :)

Fixed :-)

Did you test it, or just assumed it worked? ;)

How very dare you. Of course I tested it. It successfully reconnects
on multiple restarts, checks intermittently when I've stopped the
server, showing the connection error message, successfully continues
when I eventually bring the server back up, and doesn't attempt a
reconnect when using -n.

So looks good to me.

Thanks - applied!

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/