bad logging around broken restore_command

Started by Jeff Janesalmost 6 years ago7 messages
#1Jeff Janes
jeff.janes@gmail.com
1 attachment(s)

If the restore command claims to have succeeded, but fails to have created
a file with the right name (due to typos or escaping or quoting issues, for
example), no complaint is issued at the time, and it then fails later with
a relatively uninformative error message like "could not locate required
checkpoint record".

if (rc == 0)
{
/*
* command apparently succeeded, but let's make sure the file is
* really there now and has the correct size.
*/
if (stat(xlogpath, &stat_buf) == 0)
{......
}
else
{
/* stat failed */
if (errno != ENOENT)
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m",
xlogpath)));
}

I don't see why ENOENT is thought to deserve the silent treatment. It
seems weird that success gets logged ("restored log file \"%s\" from
archive"), but one particular type of unexpected failure does not.

I've attached a patch which emits a LOG message for ENOENT. The exact
wording doesn't matter to me, I'm sure someone can improve it.
Alternatively, perhaps the message a few lines down, "could not restore
file", could get promoted from DEBUG2 to LOG when rc indicates success.
But I don't think we need any more messages which say "Something went
wrong: success".

This is based on the question at
https://stackoverflow.com/questions/60056597/couldnt-restore-postgres-v11-from-pg-basebackup.
But this isn' the first time I've seen similar confusion.

Cheers,

Jeff

Attachments:

restore_log.patchapplication/octet-stream; name=restore_log.patchDownload
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 188b73e..061f1e8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -265,7 +265,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 		else
 		{
 			/* stat failed */
-			if (errno != ENOENT)
+			if (errno == ENOENT)
+				ereport(LOG,
+						 (errmsg("Restore command reported success, but did not create the requested file \"%s\"",
+								xlogpath)));
+			else
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not stat file \"%s\": %m",
#2Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jeff Janes (#1)
1 attachment(s)
Re: bad logging around broken restore_command

On 2020/02/06 1:10, Jeff Janes wrote:

If the restore command claims to have succeeded, but fails to have created a file with the right name (due to typos or escaping or quoting issues, for example), no complaint is issued at the time, and it then fails later with a relatively uninformative error message like "could not locate required checkpoint record".

    if (rc == 0)
    {
        /*
         * command apparently succeeded, but let's make sure the file is
         * really there now and has the correct size.
         */
        if (stat(xlogpath, &stat_buf) == 0)
        {......
        }
        else
        {
            /* stat failed */
            if (errno != ENOENT)
                ereport(FATAL,
                        (errcode_for_file_access(),
                         errmsg("could not stat file \"%s\": %m",
                                xlogpath)));
        }

I don't see why ENOENT is thought to deserve the silent treatment.  It seems weird that success gets logged ("restored log file \"%s\" from archive"), but one particular type of unexpected failure does not.

Agreed.

I've attached a patch which emits a LOG message for ENOENT.

Isn't it better to use "could not stat file" message even in ENOENT case?
If yes, something like message that you used in the patch should be
logged as DETAIL or HINT message. So, what about the attached patch?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Attachments:

restore_log_fujii.patchtext/plain; charset=UTF-8; name=restore_log_fujii.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 188b73e752..0a28b7a562 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -265,11 +265,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 		else
 		{
 			/* stat failed */
-			if (errno != ENOENT)
-				ereport(FATAL,
-						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m",
-								xlogpath)));
+			int	elevel = (errno == ENOENT) ? LOG : FATAL;
+			ereport(elevel,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", xlogpath),
+					 errdetail("restore_command returned a zero exit status, but stat() failed.")));
 		}
 	}
 
#3David Steele
david@pgmasters.net
In reply to: Fujii Masao (#2)
Re: bad logging around broken restore_command

Hi Jeff,

On 2/6/20 9:23 AM, Fujii Masao wrote:

I've attached a patch which emits a LOG message for ENOENT.

Isn't it better to use "could not stat file" message even in ENOENT case?
If yes, something like message that you used in the patch should be
logged as DETAIL or HINT message. So, what about the attached patch?

What do you think of Fujii's changes?

Regards,
--
-David
david@pgmasters.net

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jeff Janes (#1)
Re: bad logging around broken restore_command

At Thu, 6 Feb 2020 23:23:42 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/02/06 1:10, Jeff Janes wrote:

If the restore command claims to have succeeded, but fails to have created
a file with the right name (due to typos or escaping or quoting issues, for
example), no complaint is issued at the time, and it then fails later with
a relatively uninformative error message like "could not locate required
checkpoint record".

...

I don't see why ENOENT is thought to deserve the silent treatment.  It
seems weird that success gets logged ("restored log file \"%s\" from
archive"), but one particular type of unexpected failure does not.

Agreed.

In the first place it is not perfectly silent and that problem cannot
happen. In the ENOENT case, the function reports "could not restore
file \"%s\" from archive: %s", but with DEBUG2 then returns false, and
the callers treat the failure properly.

I've attached a patch which emits a LOG message for ENOENT.

Isn't it better to use "could not stat file" message even in ENOENT
case?
If yes, something like message that you used in the patch should be
logged as DETAIL or HINT message. So, what about the attached patch?

If you want to see some log messages in the case, it is sufficient to
raise the loglevel of the existing message from DEBUG2 to LOG.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#4)
Re: bad logging around broken restore_command

On 2020/03/10 11:47, Kyotaro Horiguchi wrote:

At Thu, 6 Feb 2020 23:23:42 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/02/06 1:10, Jeff Janes wrote:

If the restore command claims to have succeeded, but fails to have created
a file with the right name (due to typos or escaping or quoting issues, for
example), no complaint is issued at the time, and it then fails later with
a relatively uninformative error message like "could not locate required
checkpoint record".

...

I don't see why ENOENT is thought to deserve the silent treatment.  It
seems weird that success gets logged ("restored log file \"%s\" from
archive"), but one particular type of unexpected failure does not.

Agreed.

In the first place it is not perfectly silent and that problem cannot
happen. In the ENOENT case, the function reports "could not restore
file \"%s\" from archive: %s", but with DEBUG2 then returns false, and
the callers treat the failure properly.

Yes.

I've attached a patch which emits a LOG message for ENOENT.

Isn't it better to use "could not stat file" message even in ENOENT
case?
If yes, something like message that you used in the patch should be
logged as DETAIL or HINT message. So, what about the attached patch?

If you want to see some log messages in the case, it is sufficient to
raise the loglevel of the existing message from DEBUG2 to LOG.

Isn't it too noisy to log every time when we could not restore
the archived file? In archive recovery case, it's common to fail
to restore archive files and try to replay WAL files in pg_wal.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#6Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Fujii Masao (#5)
Re: bad logging around broken restore_command

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

I decided to add my review to this simple patch.
I applied Fuji's patch and found it perfectly working with installcheck-world passed, code is clean.
As for the feature I agree with Jeff and Fuji that this ENOENT case is worth logging under LOG priority.
I consider the last (Fuji's) patch is now ready to be committed.

Best regards,
Pavel Borisov

The new status of this patch is: Ready for Committer

#7Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Pavel Borisov (#6)
Re: bad logging around broken restore_command

On 2020/11/19 20:27, Pavel Borisov wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

I decided to add my review to this simple patch.
I applied Fuji's patch and found it perfectly working with installcheck-world passed, code is clean.
As for the feature I agree with Jeff and Fuji that this ENOENT case is worth logging under LOG priority.
I consider the last (Fuji's) patch is now ready to be committed.

Thanks for the review! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION