"pg_ctl: the PID file ... is empty" at end of make check

Started by Thomas Munroover 7 years ago15 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hello,

Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04

Anyone know what that's about?

--
Thomas Munro
http://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#1)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25%2005%3A39%3A04
Anyone know what that's about?

No :-(. I was wondering about that longfin failure too. What macOS
version are you running?

(longfin is up to date, 10.14.1. Is it worth noting that I jacked
up the name and drove new hardware underneath it just last week?
Hard to see how that could be related, but ...)

regards, tom lane

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Wed, Nov 28, 2018 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25%2005%3A39%3A04
Anyone know what that's about?

No :-(. I was wondering about that longfin failure too. What macOS
version are you running?

(longfin is up to date, 10.14.1. Is it worth noting that I jacked
up the name and drove new hardware underneath it just last week?
Hard to see how that could be related, but ...)

I have not yet accepted the offer to update to 10.14.1, so I guess it
must be 10.14.0, though the about box just says 10.14. According to
sw_vers, it's 10.14 build 18A391.

--
Thomas Munro
http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#1)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25%2005%3A39%3A04

I trawled the buildfarm logs and discovered a second instance of exactly
the same thing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-19%2018%3A37%3A00

There have not been any other occurrences in the past 3 months, which is
as far back as I went. (lorikeet has half a dozen occurrences of "could
not stop postmaster", which is what I was grepping for, but they all
are associated with that machine's intermittent postmaster crashes.)

So that lets out the flaky-hardware theory: that occurrence is before
longfin's hardware transplant.

Also, I don't think I believe the OS-bug idea either, given that you
saw it on 10.14.0. longfin's been running 10.14.something since
2018-09-26, and has accumulated circa 200 runs since then just on HEAD,
never mind the back branches. It'd be pretty unlikely to see it only
in the past week, and only on HEAD, if it were an OS bug introduced two
months ago.

So my theory is we broke something in HEAD a couple weeks ago. But what?

The fsync changes you made are suspiciously close to this issue (ie one
could explain it as written data not getting out), and were committed in
the right time frame, but that change didn't affect writes to
postmaster.pid did it?

regards, tom lane

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#4)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25%2005%3A39%3A04

I trawled the buildfarm logs and discovered a second instance of exactly
the same thing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-19%2018%3A37%3A00

There have not been any other occurrences in the past 3 months, which is
as far back as I went. (lorikeet has half a dozen occurrences of "could
not stop postmaster", which is what I was grepping for, but they all
are associated with that machine's intermittent postmaster crashes.)

So that lets out the flaky-hardware theory: that occurrence is before
longfin's hardware transplant.

Also, I don't think I believe the OS-bug idea either, given that you
saw it on 10.14.0. longfin's been running 10.14.something since
2018-09-26, and has accumulated circa 200 runs since then just on HEAD,
never mind the back branches. It'd be pretty unlikely to see it only
in the past week, and only on HEAD, if it were an OS bug introduced two
months ago.

Yeah, it'd be slightly easier to believe when High Sierra first came
out and every hfs+ volume was silently migrated to the brand new apfs.
But yeah, that idea seems like a long shot at this point.

So my theory is we broke something in HEAD a couple weeks ago. But what?

Hmm. Not seeing it. I'm trying to do it again, with a make check loop.

The fsync changes you made are suspiciously close to this issue (ie one
could explain it as written data not getting out), and were committed in
the right time frame, but that change didn't affect writes to
postmaster.pid did it?

Commit 9ccdd7f6 doesn't affect writes to anything. It just changes
the elevel if certain fsync calls fail (and incidentally none near
this code, and in any case there was no failure).

--
Thomas Munro
http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#5)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So my theory is we broke something in HEAD a couple weeks ago. But what?

Hmm. Not seeing it. I'm trying to do it again, with a make check loop.

The fsync changes you made are suspiciously close to this issue (ie one
could explain it as written data not getting out), and were committed in
the right time frame, but that change didn't affect writes to
postmaster.pid did it?

Commit 9ccdd7f6 doesn't affect writes to anything. It just changes
the elevel if certain fsync calls fail (and incidentally none near
this code, and in any case there was no failure).

Yeah ... the other weird thing about this is that the postmaster log
shows normal shutdown.

Is it possible that unlink() on APFS is not atomic? That is, the
sequence of events is something like

pg_ctl: open("postmaster.pid")
postmaster: unlink("postmaster.pid")
pg_ctl: reads file, gets zero bytes

But that idea is in the OS-bug class, which we agree seems unlikely.

regards, tom lane

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#6)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Is it possible that unlink() on APFS is not atomic? That is, the
sequence of events is something like

pg_ctl: open("postmaster.pid")
postmaster: unlink("postmaster.pid")
pg_ctl: reads file, gets zero bytes

But that idea is in the OS-bug class, which we agree seems unlikely.

I think you might be right.

https://github.com/macdice/unlinktest

--
Thomas Munro
http://www.enterprisedb.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#7)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Is it possible that unlink() on APFS is not atomic?

I think you might be right.
https://github.com/macdice/unlinktest

Bleah. But you can do better than ask whether it's a bug: you can
quote POSIX:

The unlink() function shall remove a link to a file. If path names a
symbolic link, unlink() shall remove the symbolic link named by path
and shall not affect any file or directory named by the contents of
the symbolic link. Otherwise, unlink() shall remove the link named by
the pathname pointed to by path and shall decrement the link count of
the file referenced by the link.

When the file's link count becomes 0 and no process has the file open,
the space occupied by the file shall be freed and the file shall no
longer be accessible. If one or more processes have the file open when
the last link is removed, the link shall be removed before unlink()
returns, but the removal of the file contents shall be postponed until
all references to the file are closed.

Not a lot of wiggle room there.

regards, tom lane

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#8)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Is it possible that unlink() on APFS is not atomic?

I think you might be right.
https://github.com/macdice/unlinktest

Bleah. But you can do better than ask whether it's a bug: you can
quote POSIX:

The unlink() function shall remove a link to a file. If path names a
symbolic link, unlink() shall remove the symbolic link named by path
and shall not affect any file or directory named by the contents of
the symbolic link. Otherwise, unlink() shall remove the link named by
the pathname pointed to by path and shall decrement the link count of
the file referenced by the link.

When the file's link count becomes 0 and no process has the file open,
the space occupied by the file shall be freed and the file shall no
longer be accessible. If one or more processes have the file open when
the last link is removed, the link shall be removed before unlink()
returns, but the removal of the file contents shall be postponed until
all references to the file are closed.

Not a lot of wiggle room there.

Agreed. Secret non-shareable bug report filed. Fingers crossed.

--
Thomas Munro
http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#9)
Re: "pg_ctl: the PID file ... is empty" at end of make check

[ blast from the past dept. ]

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

https://github.com/macdice/unlinktest

Bleah. But you can do better than ask whether it's a bug: you can
quote POSIX:
...
Not a lot of wiggle room there.

Agreed. Secret non-shareable bug report filed. Fingers crossed.

Since that conversation, longfin has shown the same symptom
just once more:

longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
a/postmaster.pid" is empty

and now prairiedog has shown it too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2019-10-14%2021%3A45%3A47

which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Trawling the buildfarm database finds no other matches in the last year,
so whatever it is, it's pretty darn improbable.

Did you ever get any reaction to that bug report?

regards, tom lane

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#10)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Agreed. Secret non-shareable bug report filed. Fingers crossed.

Since that conversation, longfin has shown the same symptom
just once more:

longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
a/postmaster.pid" is empty

and now prairiedog has shown it too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2019-10-14%2021%3A45%3A47

which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

Trawling the buildfarm database finds no other matches in the last year,
so whatever it is, it's pretty darn improbable.

Did you ever get any reaction to that bug report?

No acknowledgement. I did learn from an anonymous source that "it’s
prioritized and in the queue".

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#11)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

and now prairiedog has shown it too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2019-10-14%2021%3A45%3A47
which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

It does, though with a very low frequency:

$ ./unlinktest
$ ./unlinktest 10000
read 0 bytes, unexpected
$ ./unlinktest 10000
read 0 bytes, unexpected
read 0 bytes, unexpected
$ ./unlinktest 10000
$

The failure rate on my recent-vintage laptop is more like one
failure every five loops.

regards, tom lane

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#12)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

and now prairiedog has shown it too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2019-10-14%2021%3A45%3A47
which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

It does, though with a very low frequency:

$ ./unlinktest
$ ./unlinktest 10000
read 0 bytes, unexpected
$ ./unlinktest 10000
read 0 bytes, unexpected
read 0 bytes, unexpected
$ ./unlinktest 10000
$

The failure rate on my recent-vintage laptop is more like one
failure every five loops.

Wow. Ok, I'll add a note to the bug report to say it's reproducible
on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
that will let me log into the bug thing. On the off-chance that
someone from Apple stumbles on this and is interested, the Radar
number is rdar://46318934 and the title is "unlink(2) is not atomic
(kernel/file system bug)".

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#13)
Re: "pg_ctl: the PID file ... is empty" at end of make check

On Fri, Oct 18, 2019 at 2:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

and now prairiedog has shown it too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&amp;dt=2019-10-14%2021%3A45%3A47
which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

It does, though with a very low frequency:

$ ./unlinktest
$ ./unlinktest 10000
read 0 bytes, unexpected
$ ./unlinktest 10000
read 0 bytes, unexpected
read 0 bytes, unexpected
$ ./unlinktest 10000
$

The failure rate on my recent-vintage laptop is more like one
failure every five loops.

Wow. Ok, I'll add a note to the bug report to say it's reproducible
on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
that will let me log into the bug thing. On the off-chance that
someone from Apple stumbles on this and is interested, the Radar
number is rdar://46318934 and the title is "unlink(2) is not atomic
(kernel/file system bug)".

My bug report got an automated-looking message telling me to retest in
Big Sur beta 6 back in September, and I've just now upgraded an old
x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
it looks like it was fixed! Thanks, Apple.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#14)
Re: "pg_ctl: the PID file ... is empty" at end of make check

Thomas Munro <thomas.munro@gmail.com> writes:

My bug report got an automated-looking message telling me to retest in
Big Sur beta 6 back in September, and I've just now upgraded an old
x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
it looks like it was fixed! Thanks, Apple.

Hah! I've occasionally despaired that Apple actually reads anything
I file there, so this is glad tidings indeed.

regards, tom lane