restore_command return code behaviour

Started by Jean-Christophe Arnu6 months ago19 messages
#1Jean-Christophe Arnu
jcarnu@gmail.com

Dear hackers,

We encountered an issue with restore_command when using scp on a v16
version. When SCP cannot connect to a host, it returns a return code of 255
(I won't discuss the decision to use such a return code). The return code of
the restore_command is tested at [1]https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/access/transam/xlogarchive.c#L268 by calling wait_result_is_any_signal()
[2]: https://github.com/postgres/postgres/blob/master/src/common/wait_error.c#L126
generated in [1]https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/access/transam/xlogarchive.c#L268 in ereport, which leads to the standby shutting down [3]https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/utils/error/elog.c#L560-L591. This
is perfectly fine for us.

However, the documentation [4]https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND -- Jean-Christophe Arnu states the following:

It is important for the command to return a zero exit status only if it

succeeds. The command *will* be asked for file names that are not present
in the archive; it must return nonzero when so asked. (...)

An exception is that if the command was terminated by a signal (other

than SIGTERM, which is used as part of a database server shutdown) or an
error by the shell (such as command not found), then recovery will abort
and the server will not start up.

Could we perhaps improve the documentation by stating that return codes
over 125 or (at least) 128 will lead to the server not starting?

This may help people better understand the behaviour of the restore_command
and quickly solve these kinds of issues without having to examine the
source code.

If you agree, we can suggest a patch for the documentation.

Any thoughts would be much appreciated!

Thank you

[1]: https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/access/transam/xlogarchive.c#L268
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/access/transam/xlogarchive.c#L268
[2]: https://github.com/postgres/postgres/blob/master/src/common/wait_error.c#L126
https://github.com/postgres/postgres/blob/master/src/common/wait_error.c#L126
[3]: https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/utils/error/elog.c#L560-L591
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/utils/error/elog.c#L560-L591
[4]: https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND -- Jean-Christophe Arnu
https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND
--
Jean-Christophe Arnu

#2Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jean-Christophe Arnu (#1)
Re: restore_command return code behaviour

On Thu, Jul 24, 2025 at 2:18 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

Could we perhaps improve the documentation by stating that return codes over 125 or (at least) 128 will lead to the server not starting?

This may help people better understand the behaviour of the restore_command and quickly solve these kinds of issues without having to examine the source code.

If you agree, we can suggest a patch for the documentation.

If you've tripped over it, I think we should consider a docs patch.

To confirm -- you're happy with the behavior as-is?

--Jacob

#3Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jacob Champion (#2)
Re: restore_command return code behaviour

On fri. 25 jul. 2025, 00:01, Jacob Champion <
jacob.champion@enterprisedb.com> wrote :

If you agree, we can suggest a patch for the documentation.

If you've tripped over it, I think we should consider a docs patch.

We've spent time on it, guessing it was a return code-driven behaviour,
analysing the code, and finally using debugger to confirm it. If we can
avoid that round trip (and any erroneous analyses stating "it does not
work") for other users, that would be ideal.

To confirm -- you're happy with the behavior as-is?

I'm completely fine with it.
restore_command returning over 128 can be wrapped in a shell script to
catch the given rc *if server shutdown is not the desired behaviour. I
agree that the docs should be as clear as possible about this behaviour
(which may also be an API, that should not be changed).

Here are the places where I think the details should be added :
- GUC documentation [1]https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND : file doc/src/sgml/config.sgml
- Backup and Restore [2]https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY file: doc/src/sgml/backup.sgml

The other mention of restore_command does not involve (or require) return
code details.
If there are no objections, I'll start writing a patch proposal on Monday.

[1]: https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND : file doc/src/sgml/config.sgml
https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND
: file doc/src/sgml/config.sgml
[2]: https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY file: doc/src/sgml/backup.sgml
https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY
file: doc/src/sgml/backup.sgml

#4Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jean-Christophe Arnu (#3)
1 attachment(s)
Re: restore_command return code behaviour

On friday, 25 jul. 2025 à 09:35, Jean-Christophe Arnu <jcarnu@gmail.com>
wrote :

On fri. 25 jul. 2025, 00:01, Jacob Champion <
jacob.champion@enterprisedb.com> wrote :

Here are the places where I think the details should be added :
- GUC documentation [1]
- Backup and Restore [2]

The other mention of restore_command does not involve (or require) return
code details.
If there are no objections, I'll start writing a patch proposal on Monday.

[1]
https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND
: file doc/src/sgml/config.sgml
[2]
https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY
file: doc/src/sgml/backup.sgml

Here's a first version of this tiny doc patch.

Hope this is clear enough.
--
Jean-Christophe Arnu

Attachments:

v1-0001-add-details-on-restore-command-return-code-128.patchtext/x-patch; charset=US-ASCII; name=v1-0001-add-details-on-restore-command-return-code-128.patchDownload
From 961bb382763e8523e79ac23ad081bc44dfab63b9 Mon Sep 17 00:00:00 2001
From: Jean-Christophe Arnu <jc.arnu@loxodata.com>
Date: Mon, 28 Jul 2025 09:49:17 +0200
Subject: [PATCH v1] add details on restore-command return code >= 128

Documentation does not mention restore_command returns codes including
128 and above. If such a code is returned, the server is shut down.
---
 doc/src/sgml/backup.sgml | 3 ++-
 doc/src/sgml/config.sgml | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f7489afbd1..caaade3d0e3 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1310,7 +1310,8 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
     a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
     part of a database server shutdown) or an error by the shell (such as
     command not found), then recovery will abort and the server will not start
-    up.
+    up. However, the server will also not start if the command returns a code
+    superior or equal to 127.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 20ccb2d6b54..548bbb33ab0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4058,6 +4058,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
         than <systemitem>SIGTERM</systemitem>, which is used as part of a
         database server shutdown) or an error by the shell (such as command
         not found), then recovery will abort and the server will not start up.
+        However, the server will also not start if the command returns a code
+        of 128 and above.
        </para>
 
        <para>
-- 
2.39.5

#5Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jean-Christophe Arnu (#4)
1 attachment(s)
Re: restore_command return code behaviour

Le lun. 28 juil. 2025 à 10:01, Jean-Christophe Arnu <jcarnu@gmail.com> a
écrit :

Here's a first version of this tiny doc patch.

Hope this is clear enough.

Sorry, the previous version had a bad wording for one file. Please do not
review it. The following version is the correct one.

--
Jean-Christophe Arnu

Attachments:

v2-0001-add-details-on-restore-command-return-code-128.patchtext/x-patch; charset=US-ASCII; name=v2-0001-add-details-on-restore-command-return-code-128.patchDownload
From 185012980d4f6b96a6081570f52975ca667d233e Mon Sep 17 00:00:00 2001
From: Jean-Christophe Arnu <jc.arnu@loxodata.com>
Date: Mon, 28 Jul 2025 10:20:12 +0200
Subject: [PATCH v2] add details on restore-command return code >= 128

Documentation does not mention restore_command returns codes including
128 and above. If such a code is returned, the server is shut down.
---
 doc/src/sgml/backup.sgml | 3 ++-
 doc/src/sgml/config.sgml | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f7489afbd1..da4bdd24a7c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1310,7 +1310,8 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
     a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
     part of a database server shutdown) or an error by the shell (such as
     command not found), then recovery will abort and the server will not start
-    up.
+    up. However, the server will also not start if the command returns a code
+    of 128 and above.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 20ccb2d6b54..548bbb33ab0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4058,6 +4058,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
         than <systemitem>SIGTERM</systemitem>, which is used as part of a
         database server shutdown) or an error by the shell (such as command
         not found), then recovery will abort and the server will not start up.
+        However, the server will also not start if the command returns a code
+        of 128 and above.
        </para>
 
        <para>
-- 
2.39.5

In reply to: Jean-Christophe Arnu (#5)
Re: restore_command return code behaviour

Hi Jean-Christophe,

On Mon, 28 Jul 2025 10:23:07 +0200
Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

Le lun. 28 juil. 2025 à 10:01, Jean-Christophe Arnu <jcarnu@gmail.com> a
écrit :

Here's a first version of this tiny doc patch.

Hope this is clear enough.

Sorry, the previous version had a bad wording for one file. Please do not
review it. The following version is the correct one.

       a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
       part of a database server shutdown) or an error by the shell (such as
       command not found), then recovery will abort and the server will not start
  -    up.
  +    up. However, the server will also not start if the command returns a code
  +    of 128 and above.

It seems redundant with the explanation in this paragraph when you know that a
code greater than 125 is returned on shell error or signal.

As I'm sure you already know, this behavior is documented on the
archive_command side using these words:

«
When the archive command is terminated by a signal (other than SIGTERM that
is used as part of a server shutdown) or an error by the shell with an exit
status greater than 125 (such as command not found), or if the archive
function emits an ERROR or FATAL, the archiver process aborts and gets
restarted by the postmaster.
»

So I assume we could keep the same documentation style for the restore_command
side:

«
An exception is that if the command was terminated by a signal (other than
SIGTERM, which is used as part of a database server shutdown) or an error by
the shell **with an exit status greater than 125** (such as command not
found), then recovery will abort and the server will not start up.
»

What do you think?

#7Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#6)
1 attachment(s)
Re: restore_command return code behaviour

Hi Jehan-Guillaume,

Le lun. 28 juil. 2025 à 12:32, Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
a écrit :

command not found), then recovery will abort and the server will
not start
-    up.
+    up. However, the server will also not start if the command returns
a code
+    of 128 and above.

It seems redundant with the explanation in this paragraph when you know
that a
code greater than 125 is returned on shell error or signal.

You're right.

As I'm sure you already know, this behavior is documented on the
archive_command side using these words:

[...]

So I assume we could keep the same documentation style for the
restore_command
side:

«
An exception is that if the command was terminated by a signal (other
than
SIGTERM, which is used as part of a database server shutdown) or an
error by
the shell **with an exit status greater than 125** (such as command not
found), then recovery will abort and the server will not start up.
»

What do you think?

You're also right. That's more consistent and easier to read.
Thank you for pointing this out.

--
Jean-Christophe Arnu

Attachments:

v3-0001-add-details-on-restore-command-return-code-125.patchtext/x-patch; charset=US-ASCII; name=v3-0001-add-details-on-restore-command-return-code-125.patchDownload
From cbacdf8cf6809c03d714d1b968992938f3010e71 Mon Sep 17 00:00:00 2001
From: Jean-Christophe Arnu <jc.arnu@loxodata.com>
Date: Mon, 28 Jul 2025 17:04:37 +0200
Subject: [PATCH v3] add details on restore-command return code > 125

Documentation does not mention restore_command returns codes greater
than 125. If such a code is returned, the server is shut down.
Use the same wording than archive_command as advised Jeahn-Guillaume de
Rothais.
---
 doc/src/sgml/backup.sgml | 6 +++---
 doc/src/sgml/config.sgml | 5 +++--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f7489afbd1..204e1c604f9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1308,9 +1308,9 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
     present in the archive; it must return nonzero when so asked.  This is not
     an error condition.  An exception is that if the command was terminated by
     a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
-    part of a database server shutdown) or an error by the shell (such as
-    command not found), then recovery will abort and the server will not start
-    up.
+    part of a database server shutdown) or an error by the shell with an exit
+    status greater than 125  (such as command not found), then recovery will
+    abort and the server will not start up.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 20ccb2d6b54..4af6c5934ae 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4056,8 +4056,9 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 </programlisting>
         An exception is that if the command was terminated by a signal (other
         than <systemitem>SIGTERM</systemitem>, which is used as part of a
-        database server shutdown) or an error by the shell (such as command
-        not found), then recovery will abort and the server will not start up.
+        database server shutdown) or an error by the shell with an exit status
+        greater than 125 (such as command not found), then recovery will abort
+        and the server will not start up.
        </para>
 
        <para>
-- 
2.39.5

#8Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jean-Christophe Arnu (#7)
Re: restore_command return code behaviour

On Mon, Jul 28, 2025 at 8:19 AM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

You're also right. That's more consistent and easier to read.
Thank you for pointing this out.

I agree that reusing archive_command's wording is probably the way to
go. I think archive_cleanup_command and recovery_end_command have the
same issue; opinions on changing those as well?

Nitpick mode: I think the current wording for archive_command could be
misleading.

an error by the shell with an exit status greater than 125 (such as command not found)

The phrase "by the shell" is not really true here (that's kind of the
point of the thread, I'd argue) and I'm wondering if we should move
that to the parenthetical. But I don't like any of my draft ideas so
far, and I don't want to get in the way of a small improvement by
demanding perfection.

Thanks,
--Jacob

#9Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jacob Champion (#8)
Re: restore_command return code behaviour

Le lun. 28 juil. 2025 à 20:15, Jacob Champion <
jacob.champion@enterprisedb.com> a écrit :

I agree that reusing archive_command's wording is probably the way to
go. I think archive_cleanup_command and recovery_end_command have the
same issue; opinions on changing those as well?

I agree, we could include both of those in the fix. But let's refine the
wording.

Nitpick mode: I think the current wording for archive_command could be
misleading.

an error by the shell with an exit status greater than 125 (such as

command not found)

The phrase "by the shell" is not really true here (that's kind of the
point of the thread, I'd argue) and I'm wondering if we should move
that to the parenthetical. But I don't like any of my draft ideas so
far, and I don't want to get in the way of a small improvement by
demanding perfection.

English is not my native language but let's try some wordings:

Proposals (scheme):

An exception is that if the command was terminated by a signal (other
than SIGTERM which is used as part of a database server shutdown)
or a command returned with an exit code greater than 125, or an error by
the shell
(such as command not found), then recovery will abort
and the server will not start up.

Or

The recovery will be aborted and the server will stop if any of the
following events occur:
- the command was terminated by a signal other than SIGTERM (which is used
as part of a database server shutdown);
- the command returns an exit code greater than 125
- the shell returns an error (such as 'command not found')

The former has a 'heavier' style; the latter has the benefit of clearly
showing each condition for shutting down the server (but it breaks the GUC
style, where bullet points are only used for defining possible values).

--
Jean-Christophe Arnu

#10Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jean-Christophe Arnu (#9)
Re: restore_command return code behaviour

On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

Or

The recovery will be aborted and the server will stop if any of the following events occur:
- the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
- the command returns an exit code greater than 125
- the shell returns an error (such as 'command not found')

The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server (but it breaks the GUC style, where bullet points are only used for defining possible values).

I like the latter. Riffing on that, we could collapse the bullet
points and reuse a bit of the current wording:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".

WDYT?
--Jacob

#11David G. Johnston
david.g.johnston@gmail.com
In reply to: Jacob Champion (#10)
Re: restore_command return code behaviour

On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com>
wrote:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".

I don’t understand calling out sigterm as an exception, the same
abort-and-shutdown action happens there too. And in any case signals are
turned into exit status values anyway so naming them specifically seems
redundant.

The “Command not found” error is defined by POSIX as exit status 127 making
its specification redundant with > 125

“The server interprets the exit status per POSIX conventions: 0 for
success, and any value between 1-125 to gracefully handle the case where a
file name is supplied that does not exist, or exhibits a temporary failure
during processing. All other exit status values are considered fatal,
causing recovery to be aborted and the server stopped.”

David J.

#12Jacob Champion
jacob.champion@enterprisedb.com
In reply to: David G. Johnston (#11)
Re: restore_command return code behaviour

On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.

The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125

I don't think either is redundant from the point of view of the
targeted audience, who may not understand the overlap in the POSIX
specification. "My command returned", "my command died", and "my
command never ran" are interesting cases to have to consider (and I
think it's unfortunate that we can't reliably tell them apart).

--Jacob

#13David G. Johnston
david.g.johnston@gmail.com
In reply to: Jacob Champion (#12)
Re: restore_command return code behaviour

On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com>
wrote:

On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

I don’t understand calling out sigterm as an exception, the same

abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name. It seems to be handled
the same as any other signal exit and, because of the “true” being passed
into wait_result_is_any_signal, identically to exit statuses 126-127 as
well.

And in any case signals are turned into exit status values anyway so

naming them specifically seems redundant.

Ok, yeah, the C API differentiates this reality since an unhandled signal
precludes an exit status from being created. I’m fine with “…unhandled
signals causing command termination, or exit statues > 125, result in abort
and stop”.

The “Command not found” error is defined by POSIX as exit status 127

making its specification redundant with > 125

I don't think either is redundant from the point of view of the
targeted audience, who may not understand the overlap in the POSIX
specification. "My command returned", "my command died", and "my
command never ran" are interesting cases to have to consider (and I
think it's unfortunate that we can't reliably tell them apart).

wait_result_to_str seems to be sufficient. Sure, the command author can
choose to violate conventions, but that’s life.

I don’t really see a point in describing the special meanings ascribed to
126 and 127 here since the error message will handle that should the need
arise.

David J.

#14Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: David G. Johnston (#13)
Re: restore_command return code behaviour

Le mar. 29 juil. 2025 à 00:38, David G. Johnston <david.g.johnston@gmail.com>
a écrit :

On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com>
wrote:

On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

I don’t understand calling out sigterm as an exception, the same

abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name. It seems to be handled
the same as any other signal exit and, because of the “true” being passed
into wait_result_is_any_signal, identically to exit statuses 126-127 as
well.

And in any case signals are turned into exit status values anyway so

naming them specifically seems redundant.

Ok, yeah, the C API differentiates this reality since an unhandled signal
precludes an exit status from being created. I’m fine with “…unhandled
signals causing command termination, or exit statues > 125, result in abort
and stop”.

Thank you!

Should we stick to the original wording, changing it slightly?

An exception is that if the command was terminated by an
unhandled signal causing command termination or an exit status
above 125 (including command not found), then recovery will abort
and the server will not start up.

Or should we use the previous proposal (in that thread) with some
adaptations?

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by an unhandled
signal or the command returns an exit status greater than 125
(including "command not found").

--
Jean-Christophe Arnu

#15Jacob Champion
jacob.champion@enterprisedb.com
In reply to: David G. Johnston (#13)
Re: restore_command return code behaviour

On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name.

It's got a comment explaining the separate behavior, right before the
call to wait_result_is_signal(rc, SIGTERM):

https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/access/transam/xlogarchive.c?id=412036c22d6a605340dbe397da1fb12fccd3897f#n254

I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message will handle that should the need arise.

I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

Thanks,
--Jacob

#16David G. Johnston
david.g.johnston@gmail.com
In reply to: Jacob Champion (#15)
Re: restore_command return code behaviour

On Wednesday, July 30, 2025, Jacob Champion <jacob.champion@enterprisedb.com>
wrote:

On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.

com> wrote:

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name.

It's got a comment explaining the separate behavior, right before the
call to wait_result_is_signal(rc, SIGTERM):

https://git.postgresql.org/cgit/postgresql.git/tree/src/
backend/access/transam/xlogarchive.c?id=412036c22d6a605340dbe397da1fb1
2fccd3897f#n254

Oops, I was looking at fe_utils not xlogarchive.

I don’t really see a point in describing the special meanings ascribed

to 126 and 127 here since the error message will handle that should the
need arise.

I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

Works for me. I do see the value in pointing out “command could not be
executed” behavior without referring to exit status codes. But it does
make the text longer and maybe we could centralize it somewhere by
leveraging the reference design. But acknowledge that most of them will
not have the knowledge memorized and being a bit more verbose would be a
useful shortcut for them.

David J.

#17Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jacob Champion (#15)
Re: restore_command return code behaviour

I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

That's right it has nothing to do with the error message. That proposal was
only intended to help any people with the server behaviour in such cases.

Not everybody reminds its POSIX return codes if not practiced. And it may
be hard (or take long) to understand why the server is stopped with a
simple scp returning 255 in some cases.

Just adding rhe "over 126 return code leads to server stop" (or so) to the
current doc version seems enough to me.

Jean-Christophe Arnu

#18David G. Johnston
david.g.johnston@gmail.com
In reply to: Jacob Champion (#10)
Re: restore_command return code behaviour

On Mon, Jul 28, 2025 at 2:22 PM Jacob Champion <
jacob.champion@enterprisedb.com> wrote:

On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com>
wrote:

Or

The recovery will be aborted and the server will stop if any of the

following events occur:

- the command was terminated by a signal other than SIGTERM (which is

used as part of a database server shutdown);

- the command returns an exit code greater than 125
- the shell returns an error (such as 'command not found')

The former has a 'heavier' style; the latter has the benefit of clearly

showing each condition for shutting down the server (but it breaks the GUC
style, where bullet points are only used for defining possible values).

I like the latter. Riffing on that, we could collapse the bullet
points and reuse a bit of the current wording:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".

How about:

Recovery will abort and the server will not start up upon any of the
following:
the shell is unable to execute the command (due to it not being found or
executable),
the command returns an exit status greater than 125, or a non-SIGTERM signal
terminates the shell process. SIGTERM allows a clean shutdown to happen,
if there is one.

Mostly just changing the order a bit but goes from most immediate when
making a change (bad command written into the GUC) to least immediate or
surprising (SIGTERM). The other two flow from there.

David J.

#19Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: David G. Johnston (#18)
Re: restore_command return code behaviour

Le lun. 4 août 2025, 04:45, David G. Johnston <david.g.johnston@gmail.com>
a écrit :

How about:

Recovery will abort and the server will not start up upon any of the
following:
the shell is unable to execute the command (due to it not being found or
executable),
the command returns an exit status greater than 125, or a non-SIGTERM
signal
terminates the shell process. SIGTERM allows a clean shutdown to happen,
if there is one.

Mostly just changing the order a bit but goes from most immediate when
making a change (bad command written into the GUC) to least immediate or
surprising (SIGTERM). The other two flow from there.

I agree. I like this concise way to explain the different cases.