archive_command failures report confusing exit status
Failures of archive_command calls report a confusing exit status such as:
LOG: archive command "cp 'pg_xlog/000000010000000000000000' '/nonexistent/000000010000000000000000'" failed: return code 256
The actual return code is 1; it neglects to apply WEXITSTATUS().
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.
I have attached a patch in that direction. It obviously needs a bit of
string struggling to get all the cases right, but the idea should be
clear. Comments?
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Attachments:
log-archive-command-exitstatus.patchtext/x-diff; charset=us-ascii; name=log-archive-command-exitstatus.patchDownload
diff -ur ../cvs-pgsql/src/backend/postmaster/pgarch.c ./src/backend/postmaster/pgarch.c
--- ../cvs-pgsql/src/backend/postmaster/pgarch.c 2007-11-26 13:29:36.000000000 +0100
+++ ./src/backend/postmaster/pgarch.c 2007-12-10 17:59:18.000000000 +0100
@@ -480,15 +480,11 @@
* SIGQUIT while waiting; so a signal is very likely something that
* should have interrupted us too. If we overreact it's no big deal,
* the postmaster will just start the archiver again.
- *
- * Per the Single Unix Spec, shells report exit status > 128 when a
- * called command died on a signal.
*/
- bool signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128;
- ereport(signaled ? FATAL : LOG,
- (errmsg("archive command \"%s\" failed: return code %d",
- xlogarchcmd, rc)));
+ // FIXME: "archive command \"%s\"", xlogarchcmd
+ // FIXME: get rid of PID = 0
+ LogChildExit(WIFSIGNALED(rc) ? FATAL : LOG, "archive command", 0, rc);
return false;
}
diff -ur ../cvs-pgsql/src/backend/postmaster/postmaster.c ./src/backend/postmaster/postmaster.c
--- ../cvs-pgsql/src/backend/postmaster/postmaster.c 2007-11-19 13:38:49.000000000 +0100
+++ ./src/backend/postmaster/postmaster.c 2007-12-10 18:00:04.000000000 +0100
@@ -305,8 +305,6 @@
static void dummy_handler(SIGNAL_ARGS);
static void CleanupBackend(int pid, int exitstatus);
static void HandleChildCrash(int pid, int exitstatus, const char *procname);
-static void LogChildExit(int lev, const char *procname,
- int pid, int exitstatus);
static void PostmasterStateMachine(void);
static void BackendInitialize(Port *port);
static int BackendRun(Port *port);
@@ -2477,7 +2475,7 @@
/*
* Log the death of a child process.
*/
-static void
+void
LogChildExit(int lev, const char *procname, int pid, int exitstatus)
{
if (WIFEXITED(exitstatus))
diff -ur ../cvs-pgsql/src/include/postmaster/postmaster.h ./src/include/postmaster/postmaster.h
--- ../cvs-pgsql/src/include/postmaster/postmaster.h 2007-02-19 11:05:59.000000000 +0100
+++ ./src/include/postmaster/postmaster.h 2007-12-10 18:00:02.000000000 +0100
@@ -37,6 +37,8 @@
extern int PostmasterMain(int argc, char *argv[]);
extern void ClosePostmasterPorts(bool am_syslogger);
+extern void LogChildExit(int lev, const char *procname,
+ int pid, int exitstatus);
#ifdef EXEC_BACKEND
extern pid_t postmaster_forkexec(int argc, char *argv[]);
Peter Eisentraut wrote:
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.
Hmm. Getting rid of the "(PID 0)" is going to be a mess enough for
translations that I think it is worth pgarch.c having its own routine
for this. Furthermore I think the detailed archive command should be
reported in an errdetail() field, which makes it even farther off.
(Hmm, there is nearly duplicate code in pclose_check already).
--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Nunca confiar� en un traidor. Ni siquiera si el traidor lo he creado yo"
(Bar�n Vladimir Harkonnen)
On Mon, 2007-12-10 at 18:27 +0100, Peter Eisentraut wrote:
Failures of archive_command calls report a confusing exit status such as:
LOG: archive command "cp 'pg_xlog/000000010000000000000000' '/nonexistent/000000010000000000000000'" failed: return code 256
The actual return code is 1; it neglects to apply WEXITSTATUS().
Yes, you're right.
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.
OK, you could sell me on that. How will you get rid of (PID = 0)?
Maybe it would be better to do that on two lines:
LOG: archive command "............." failed
FATAL: archive_command was terminated by signal ...
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Peter Eisentraut wrote:
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.
Hmm. Getting rid of the "(PID 0)" is going to be a mess enough for
translations that I think it is worth pgarch.c having its own routine
for this. Furthermore I think the detailed archive command should be
reported in an errdetail() field, which makes it even farther off.
I agree with Alvaro --- trying to make LogChildExit serve two masters
will be uglier and less maintainable than having two copies of the
not-really-so-complex logic involved. Leave postmaster.c alone and
just make pgarch.c smarter.
(But having said that, +1 for improving the message.)
regards, tom lane
Am Montag, 10. Dezember 2007 schrieb Alvaro Herrera:
Peter Eisentraut wrote:
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.Hmm. Getting rid of the "(PID 0)" is going to be a mess enough for
translations that I think it is worth pgarch.c having its own routine
for this. Furthermore I think the detailed archive command should be
reported in an errdetail() field, which makes it even farther off.
Better patch.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Attachments:
log-archive-command-exitstatus.patchtext/x-diff; charset=iso-8859-1; name=log-archive-command-exitstatus.patchDownload
diff -cr ../cvs-pgsql/src/backend/postmaster/pgarch.c ./src/backend/postmaster/pgarch.c
*** ../cvs-pgsql/src/backend/postmaster/pgarch.c 2007-11-26 13:29:36.000000000 +0100
--- ./src/backend/postmaster/pgarch.c 2007-12-11 17:10:50.000000000 +0100
***************
*** 474,494 ****
rc = system(xlogarchcmd);
if (rc != 0)
{
! /*
! * If either the shell itself, or a called command, died on a signal,
! * abort the archiver. We do this because system() ignores SIGINT and
! * SIGQUIT while waiting; so a signal is very likely something that
! * should have interrupted us too. If we overreact it's no big deal,
! * the postmaster will just start the archiver again.
! *
! * Per the Single Unix Spec, shells report exit status > 128 when a
! * called command died on a signal.
! */
! bool signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128;
!
! ereport(signaled ? FATAL : LOG,
! (errmsg("archive command \"%s\" failed: return code %d",
! xlogarchcmd, rc)));
return false;
}
--- 474,514 ----
rc = system(xlogarchcmd);
if (rc != 0)
{
! if (WIFEXITED(rc))
! {
! ereport(LOG,
! (errmsg("archive command failed with exit code %d", WEXITSTATUS(rc)),
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
! else if (WIFSIGNALED(rc))
! {
! /*
! * If either the shell itself, or a called command, died
! * on a signal, abort the archiver. We do this because
! * system() ignores SIGINT and SIGQUIT while waiting; so a
! * signal is very likely something that should have
! * interrupted us too. If we overreact it's no big deal,
! * the postmaster will just start the archiver again.
! */
! ereport(FATAL, (
! #if defined(WIN32)
! errmsg("archive command was terminated by exception 0x%X", WTERMSIG(rc)),
! errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value."),
! #elif defined(HAVE_DECL_SYS_SIGLIST) && HAVE_DECL_SYS_SIGLIST
! errmsg("archive command was terminated by signal %d: %s",
! WTERMSIG(rc),
! WTERMSIG(rc) < NSIG ? sys_siglist[WTERMSIG(rc)] : "(unknown)"),
! #else
! errmsg("archive command was terminated by signal %d", WTERMSIG(exitstatus)),
! #endif
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
! else
! {
! ereport(LOG,
! (errmsg("archive command exited with unrecognized status %d", rc),
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
return false;
}
On Tue, 2007-12-11 at 17:27 +0100, Peter Eisentraut wrote:
Am Montag, 10. Dezember 2007 schrieb Alvaro Herrera:
Peter Eisentraut wrote:
I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.Hmm. Getting rid of the "(PID 0)" is going to be a mess enough for
translations that I think it is worth pgarch.c having its own routine
for this. Furthermore I think the detailed archive command should be
reported in an errdetail() field, which makes it even farther off.Better patch.
Looks much easier to understand.
I prefer archive_command = ... for the errdetail though. The commands
can be quite long so the extra words don't really add anything, plus
they require translation.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Peter Eisentraut <peter_e@gmx.net> writes:
Better patch.
Doesn't this patch break the behavior that is documented in the comment?
Specifically, the case where the restore_command dies on a signal and
this is reported to us by the controlling shell as exitcode > 128.
We want the archiver to die, but this patch makes it not do so.
regards, tom lane
Am Dienstag, 11. Dezember 2007 schrieb Tom Lane:
Peter Eisentraut <peter_e@gmx.net> writes:
Better patch.
Doesn't this patch break the behavior that is documented in the comment?
Specifically, the case where the restore_command dies on a signal and
this is reported to us by the controlling shell as exitcode > 128.
We want the archiver to die, but this patch makes it not do so.
AFAICT, the coding
WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128
is simply redundant, because a signal happened exactly when WIFSIGNALED(rc) is
true.
I have tested this:
LOG: database system was shut down at 2007-12-11 17:15:43 CET
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
FATAL: archive command was terminated by signal 1: Hangup
DETAIL: The archive command was "kill -1 $$".
LOG: archiver process (PID 22572) exited with exit code 1
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes:
Am Dienstag, 11. Dezember 2007 schrieb Tom Lane:
Doesn't this patch break the behavior that is documented in the comment?
Specifically, the case where the restore_command dies on a signal and
this is reported to us by the controlling shell as exitcode > 128.
We want the archiver to die, but this patch makes it not do so.
AFAICT, the coding
WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128
is simply redundant, because a signal happened exactly when WIFSIGNALED(rc) is
true.
No, you are confusing the cases "called shell was killed by a signal"
and "called command was killed by a signal, which the shell then turned
around and reported to us as exit > 128".
I have tested this:
A single test case proves little. You need to consider race conditions
and cases where the shell is ignoring the particular signal. I'm fairly
certain that both of these exit statuses can happen in practice, at
least with some shells/platforms.
regards, tom lane
Simon Riggs wrote:
I prefer archive_command = ... for the errdetail though. The commands
can be quite long so the extra words don't really add anything, plus
they require translation.
I did like this suggestion, but then I noticed, we don't actually report the
setting of the archive_command setting but the actual command that was
executed, with the placeholders filled out. So I think the way I posted it
is more correct.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Tom Lane wrote:
No, you are confusing the cases "called shell was killed by a signal"
and "called command was killed by a signal, which the shell then turned
around and reported to us as exit > 128".
Yes, I had missed that difference. Next try ...
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Attachments:
log-archive-command-exitstatus.patchtext/x-diff; charset=iso-8859-1; name=log-archive-command-exitstatus.patchDownload
diff -cr ../cvs-pgsql/src/backend/postmaster/pgarch.c ./src/backend/postmaster/pgarch.c
*** ../cvs-pgsql/src/backend/postmaster/pgarch.c 2007-11-25 12:39:56.000000000 +0100
--- ./src/backend/postmaster/pgarch.c 2007-12-11 23:28:43.000000000 +0100
***************
*** 484,494 ****
* Per the Single Unix Spec, shells report exit status > 128 when a
* called command died on a signal.
*/
! bool signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128;
! ereport(signaled ? FATAL : LOG,
! (errmsg("archive command \"%s\" failed: return code %d",
! xlogarchcmd, rc)));
return false;
}
--- 484,518 ----
* Per the Single Unix Spec, shells report exit status > 128 when a
* called command died on a signal.
*/
! int lev = (WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128) ? FATAL : LOG;
! if (WIFEXITED(rc))
! {
! ereport(lev,
! (errmsg("archive command failed with exit code %d", WEXITSTATUS(rc)),
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
! else if (WIFSIGNALED(rc))
! {
! ereport(lev, (
! #if defined(WIN32)
! errmsg("archive command was terminated by exception 0x%X", WTERMSIG(rc)),
! errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value."),
! #elif defined(HAVE_DECL_SYS_SIGLIST) && HAVE_DECL_SYS_SIGLIST
! errmsg("archive command was terminated by signal %d: %s",
! WTERMSIG(rc),
! WTERMSIG(rc) < NSIG ? sys_siglist[WTERMSIG(rc)] : "(unknown)"),
! #else
! errmsg("archive command was terminated by signal %d", WTERMSIG(exitstatus)),
! #endif
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
! else
! {
! ereport(lev,
! (errmsg("archive command exited with unrecognized status %d", rc),
! errdetail("The archive command was \"%s\".", xlogarchcmd)));
! }
return false;
}
On Tue, 2007-12-11 at 23:31 +0100, Peter Eisentraut wrote:
Simon Riggs wrote:
I prefer archive_command = ... for the errdetail though. The commands
can be quite long so the extra words don't really add anything, plus
they require translation.I did like this suggestion, but then I noticed, we don't actually report the
setting of the archive_command setting but the actual command that was
executed, with the placeholders filled out. So I think the way I posted it
is more correct.
I think you should lose the "The", if nothing else. Most Postgres
messages I recall say "return code = xx" not "The return code ..."
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Tue, 2007-12-11 at 23:31 +0100, Peter Eisentraut wrote:
Simon Riggs wrote:
I prefer archive_command = ... for the errdetail though. The commands
can be quite long so the extra words don't really add anything, plus
they require translation.I did like this suggestion, but then I noticed, we don't actually report the
setting of the archive_command setting but the actual command that was
executed, with the placeholders filled out. So I think the way I posted it
is more correct.I think you should lose the "The", if nothing else. Most Postgres
messages I recall say "return code = xx" not "The return code ..."
Right, that's because they are in errmessage() and not errdetail(). The
guidelines are different. (The reason they are different really escapes
me, but you have to note that (1) Peter invented them, and (2) this
grammatical difference always make sense when you read the logs.)
--
Alvaro Herrera http://www.PlanetPostgreSQL.org/
"There is evil in the world. There are dark, awful things. Occasionally, we get
a glimpse of them. But there are dark corners; horrors almost impossible to
imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)
Peter Eisentraut <peter_e@gmx.net> writes:
Yes, I had missed that difference. Next try ...
Looks sane to me.
regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Simon Riggs wrote:
I think you should lose the "The", if nothing else. Most Postgres
messages I recall say "return code = xx" not "The return code ..."
Right, that's because they are in errmessage() and not errdetail(). The
guidelines are different.
Right --- errdetail is supposed to be a complete sentence. (The
rationale for that is explained in TFM.) I do find the current phrasing
a bit awkward, though; what's bugging me is putting quotes around the
command string. It seems highly likely that the command string will
itself contain quotes and thus that the added quotes will be confusing.
I wonder if it'd be OK to do
The failed archive command was: %s
which is stretching the style guidelines by omitting a trailing period,
but I think that might be justifiable in this context.
regards, tom lane