Failure with 004_logrotate in prairiedog

Started by Michael Paquierover 4 years ago9 messages
#1Michael Paquier
michael@paquier.xyz

Hi all,

prairiedog has failed in a way that seems a bit obscure to me:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29

Here are the details of the failure:
server signaled to rotate log file
could not read
"/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
No such file or directory at t/004_logrotate.pl line 78
### Stopping node "primary" using mode immediate

update_metainfo_datafile() creates a temporary file renamed to
current_logfiles with rename(). It should be atomic, though this
error points out that this is not the case? The previous steps of
this test ensure that current_logfiles should exist.

We could use some eval blocks in this area, but a non-atomic rename()
would cause problems in more areas. Thoughts?
--
Michael

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#1)
Re: Failure with 004_logrotate in prairiedog

Michael Paquier <michael@paquier.xyz> writes:

prairiedog has failed in a way that seems a bit obscure to me:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2021-07-18%2000%3A23%3A29
...
We could use some eval blocks in this area, but a non-atomic rename()
would cause problems in more areas. Thoughts?

Awhile back we discovered that old macOS versions have non-atomic rename
[1]: /messages/by-id/2636.1569016167@sss.pgh.pa.us
often enough to be annoying. I'd not seen such a failure before on
prairiedog, but it sure seems plausible that this is one.

regards, tom lane

[1]: /messages/by-id/2636.1569016167@sss.pgh.pa.us

#3Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
Re: Failure with 004_logrotate in prairiedog

On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:

Awhile back we discovered that old macOS versions have non-atomic rename
[1]. I eventually shut down dromedary because that was causing failures
often enough to be annoying. I'd not seen such a failure before on
prairiedog, but it sure seems plausible that this is one.

Thanks for the pointer. This indeed looks like the same problem.
--
Michael

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#3)
Re: Failure with 004_logrotate in prairiedog

Michael Paquier <michael@paquier.xyz> writes:

On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:

Awhile back we discovered that old macOS versions have non-atomic rename
[1]. I eventually shut down dromedary because that was causing failures
often enough to be annoying. I'd not seen such a failure before on
prairiedog, but it sure seems plausible that this is one.

Thanks for the pointer. This indeed looks like the same problem.

For context, dromedary (now florican) is a dual-CPU machine while
prairiedog has but one CPU. I'd thought that maybe not being
multi-CPU insulated prairiedog from the non-atomic-rename problem.
But now it looks like it's merely a lot less probable on that hardware.

regards, tom lane

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#1)
Re: Failure with 004_logrotate in prairiedog

At Sun, 18 Jul 2021 12:32:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in

Hi all,

prairiedog has failed in a way that seems a bit obscure to me:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2021-07-18%2000%3A23%3A29

Here are the details of the failure:
server signaled to rotate log file
could not read
"/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
No such file or directory at t/004_logrotate.pl line 78
### Stopping node "primary" using mode immediate

update_metainfo_datafile() creates a temporary file renamed to
current_logfiles with rename(). It should be atomic, though this
error points out that this is not the case? The previous steps of
this test ensure that current_logfiles should exist.

We could use some eval blocks in this area, but a non-atomic rename()
would cause problems in more areas. Thoughts?

PostgresNode.logrotate() just invokes pg_ctl logrotate, which ends
with triggering log rotation by a signal.

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

For non-win32 platforms, the error is identifiable by #!{ENOENT} but
I'm not sure how we can identify the error for createFile(). $!
doesn't work, and $^E returns a human-readable string in the platform
language..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#5)
Re: Failure with 004_logrotate in prairiedog

On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

I can read the following code, as of update_metainfo_datafile():
if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)
ereport(LOG,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",
LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE)));

This creates a temporary file that gets renamed to current_logfiles.
--
Michael

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro Horiguchi (#5)
Re: Failure with 004_logrotate in prairiedog

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

Oh! Yeah, that's dumb, we should fix it to use rename(). Can't blame
platform's rename() if it's not being used.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#6)
Re: Failure with 004_logrotate in prairiedog

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

I can read the following code, as of update_metainfo_datafile():
if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)

Yeah, ignore my previous message. There is an unlink up at the top
of the function, which fooled me in my caffeine-deprived state.
But that path is only taken when logging was just turned off, so
we must remove the now-irrelevant metafile.

regards, tom lane

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#8)
Re: Failure with 004_logrotate in prairiedog

At Mon, 19 Jul 2021 10:23:46 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

I can read the following code, as of update_metainfo_datafile():
if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)

Yeah, ignore my previous message. There is an unlink up at the top
of the function, which fooled me in my caffeine-deprived state.

Yeah, sorry for the stupidity.

But that path is only taken when logging was just turned off, so
we must remove the now-irrelevant metafile.

I'm not sure this is relevant, I found the following article. (as a
token of my apology:p)

http://www.weirdnet.nl/apple/rename.html

There is an easy way to empirically prove that rename() is not
atomic on Leopard 10.5.2. All you have to do is create a link to a
directory, replace that link with a

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center