[PATCH] Report exit code from external recovery commands properly

Started by Peter Eisentrautabout 12 years ago6 messages
#1Peter Eisentraut
peter_e@gmx.net
1 attachment(s)

When an external recovery command such as restore_command or
archive_cleanup_command fails, it just reports "return code 34567" or
something, but we have facilities to do decode this properly, so use
them.

Attachments:

0001-Report-exit-code-from-external-recovery-commands-pro.patchtext/x-patch; charset=UTF-8; name=0001-Report-exit-code-from-external-recovery-commands-pro.patchDownload
>From 8aa3cf503fe1c1f41a2a833c008f4273c22a86c9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Wed, 13 Nov 2013 06:38:18 -0500
Subject: [PATCH] Report exit code from external recovery commands properly

When an external recovery command such as restore_command or
archive_cleanup_command fails, report the exit code properly,
distinguishing signals and normal exists, using the existing
wait_result_to_str() facility, instead of just reporting the return
value from system().
---
 src/backend/access/transam/xlogarchive.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 342975c..be95684 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -300,8 +300,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
 
 	ereport(signaled ? FATAL : DEBUG2,
-		(errmsg("could not restore file \"%s\" from archive: return code %d",
-				xlogfname, rc)));
+		(errmsg("could not restore file \"%s\" from archive: %s",
+				xlogfname, wait_result_to_str(rc))));
 
 not_available:
 
@@ -410,9 +410,10 @@ ExecuteRecoveryCommand(char *command, char *commandName, bool failOnSignal)
 		ereport((signaled && failOnSignal) ? FATAL : WARNING,
 		/*------
 		   translator: First %s represents a recovery.conf parameter name like
-		  "recovery_end_command", and the 2nd is the value of that parameter. */
-				(errmsg("%s \"%s\": return code %d", commandName,
-						command, rc)));
+		  "recovery_end_command", the 2nd is the value of that parameter, the
+		  third an already translated error message. */
+				(errmsg("%s \"%s\": %s", commandName,
+						command, wait_result_to_str(rc))));
 	}
 }
 
-- 
1.8.4.2

#2Peter Geoghegan
pg@heroku.com
In reply to: Peter Eisentraut (#1)
Re: [PATCH] Report exit code from external recovery commands properly

On Wed, Nov 13, 2013 at 3:42 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

When an external recovery command such as restore_command or
archive_cleanup_command fails, it just reports "return code 34567" or
something, but we have facilities to do decode this properly, so use
them.

I think this is a very good idea, but you should go a bit further:
document the special relationship restore_command has to special
return codes. Currently, the documentation says:

"It is important that the archive command return zero exit status if
and only if it succeeded. Upon getting a zero result, PostgreSQL will
assume that the WAL segment file has been successfully archived, and
will remove or recycle it. However, a nonzero status tells PostgreSQL
that the file was not archived; it will try again periodically until
it succeeds."

Yes, this concerns archive_command (where return code values that are
non-zero *are* never distinguished), but nothing much is said about
the return code of restore_command specifically anywhere else, so it's
implied that it's exactly inverse to archive_command. In reality, some
special return codes have a significance to restore_command: they make
recovery abort, because they're taking as proxies for various failures
that it isn't sensible to continue recovery in the event of.

We're talking about the difference between recovery aborting, and
recovery having conceptually "reached the end of the WAL stream", so
it's very surprising that this isn't documented currently.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Geoghegan (#2)
1 attachment(s)
Re: [PATCH] Report exit code from external recovery commands properly

On Wed, 2013-11-13 at 19:14 -0800, Peter Geoghegan wrote:

I think this is a very good idea, but you should go a bit further:
document the special relationship restore_command has to special
return codes.

How about this?

Attachments:

recovery-command-exit-doc.patchtext/x-patch; charset=UTF-8; name=recovery-command-exit-doc.patchDownload
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 1712974..995933c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1084,9 +1084,17 @@ <title>Recovering Using a Continuous Archive Backup</title>
 
    <para>
     It is important that the command return nonzero exit status on failure.
-    The command <emphasis>will</> be called requesting files that are not present
-    in the archive; it must return nonzero when so asked.  This is not an
-    error condition.  Not all of the requested files will be WAL segment
+    The command <emphasis>will</> be called requesting files that are not
+    present in the archive; it must return nonzero when so asked.  This is not
+    an error condition.  An exception is that if the command was terminated by
+    a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
+    part of a database server shutdown) or an error by the shell (such as
+    command not found), then recovery will abort and the server will not start
+    up.
+   </para>
+
+   <para>
+    Not all of the requested files will be WAL segment
     files; you should also expect requests for files with a suffix of
     <literal>.backup</> or <literal>.history</>. Also be aware that
     the base name of the <literal>%p</> path will be different from
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index c0c543e..9d80256 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -70,6 +70,10 @@ <title>Archive Recovery Settings</title>
 restore_command = 'cp /mnt/server/archivedir/%f "%p"'
 restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 </programlisting>
+        An exception is that if the command was terminated by a signal (other
+        than <systemitem>SIGTERM</systemitem>, which is used as part of a
+        database server shutdown) or an error by the shell (such as command
+        not found), then recovery will abort and the server will not start up.
        </para>
       </listitem>
      </varlistentry>
@@ -106,8 +110,10 @@ <title>Archive Recovery Settings</title>
         command.
        </para>
        <para>
-        If the command returns a non-zero exit status then a WARNING log
-        message will be written.
+        If the command returns a nonzero exit status then a warning log
+        message will be written.  An exception is that if the command was
+        terminated by a signal or an error by the shell (such as command not
+        found), a fatal error will be raised.
        </para>
       </listitem>
      </varlistentry>
@@ -127,10 +133,11 @@ <title>Archive Recovery Settings</title>
         last valid restart point, like in <xref linkend="archive-cleanup-command">.
        </para>
        <para>
-        If the command returns a non-zero exit status then a WARNING log
+        If the command returns a nonzero exit status then a warning log
         message will be written and the database will proceed to start up
         anyway.  An exception is that if the command was terminated by a
-        signal, the database will not proceed with startup.
+        signal or an error by the shell (such as command not found), the
+        database will not proceed with startup.
        </para>
       </listitem>
      </varlistentry>
#4Peter Geoghegan
pg@heroku.com
In reply to: Peter Eisentraut (#3)
Re: [PATCH] Report exit code from external recovery commands properly

On Sun, Nov 24, 2013 at 5:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

How about this?

Hmm. You say:

+        If the command returns a nonzero exit status then a warning log
+        message will be written.  An exception is that if the command was
+        terminated by a signal or an error by the shell (such as command not
+        found), a fatal error will be raised.

But in the case of the archiver, in contrast to the startup process,
this isn't really a big deal. It'll just pick up where it left off.
Whereas the reaper code shuts down the system if this happens in the
startup process. In my opinion that's a distinction that bears
emphasizing.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Geoghegan (#4)
Re: [PATCH] Report exit code from external recovery commands properly

On Sat, 2013-11-30 at 15:56 -0800, Peter Geoghegan wrote:

On Sun, Nov 24, 2013 at 5:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

How about this?

Hmm. You say:

+        If the command returns a nonzero exit status then a warning log
+        message will be written.  An exception is that if the command was
+        terminated by a signal or an error by the shell (such as command not
+        found), a fatal error will be raised.

But in the case of the archiver, in contrast to the startup process,
this isn't really a big deal. It'll just pick up where it left off.
Whereas the reaper code shuts down the system if this happens in the
startup process. In my opinion that's a distinction that bears
emphasizing.

That snippet you quote is about archive_cleanup_command. My patch
doesn't touch archive_command at all.

The current documentation about archive_command contains

<para>
It is important that the archive command return zero exit status if and
only if it succeeds. Upon getting a zero result,
<productname>PostgreSQL</> will assume that the file has been
successfully archived, and will remove or recycle it. However, a nonzero
status tells <productname>PostgreSQL</> that the file was not archived;
it will try again periodically until it succeeds.
</para>

which I think addresses your point.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Peter Geoghegan
pg@heroku.com
In reply to: Peter Eisentraut (#5)
Re: [PATCH] Report exit code from external recovery commands properly

On Mon, Dec 2, 2013 at 6:23 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

which I think addresses your point.

Fair enough. Marked "ready for committer".

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers