restore-command error handling

Started by Sebastiaan Mannemover 4 years ago1 messagesgeneral
Jump to latest
#1Sebastiaan Mannem
sebas@mannem.nl

Hi,

this should probably be for pgsql-hackers, but https://www.postgresql.org/list/ mentioned 'You must try elsewhere first!', and this list was second best...

I wanted to point you to this github issue:

https://github.com/wal-g/wal-g/issues/1126

Basically, Postgres only knows of 3 types of return codes:

0: No problem, next WAL file...

1 - 125: End of timeline? Ok, lets stop recovery and go online

=126: Ouch, big problem. Better not proceed, but error out with a FAIL instead

Looking at https://tldp.org/LDP/abs/html/exitcodes.html exit codes beyond 125 is all OS related.

Like 'Permission problem or command is not an executable', or 'Control-C is fatal error signal 2'.

I would assume that exit code 78 would be a better choice to distinguish errors for the restore_command which are not os-related, but still would be better ending in 'Ouch, big problem. Better not proceed, but error out with a FAIL instead'.

I think I will work on a fix for wal-g to better distinguish in exit codes, but since all I currently can do is exit with a code >= 126, I wanted to bring this to the postgres community too.

Furthermore, this is beyond wal-g, basically for everything that runs as a restore_command...

Would you consider another exit code to the list so that restore_commands don't need to exit with error codes that where meant to signal OS-level issues?

I wanted to end with this quote from the second link I pointed to:

Ending a script with exit 127 would certainly cause confusion when troubleshooting (is the error code a "command not found" or a user-defined one?).

However, many scripts use an exit 1 as a general bailout-upon-error.

Since exit code 1 signifies so many possible errors, it is not particularly useful in debugging.

Which to me is not just for 127, but for all exit codes beyond 125...

Thanks.