CreateLockFile() race condition

Started by Noah Mischover 13 years ago4 messages
#1Noah Misch
noah@leadboat.com

Despite its efforts, CreateLockFile() does not reliably prevent multiple
closely-timed postmasters from completing startup on the same data directory.
This test procedure evades its defenses:

1. kill -9 the running postmaster for $PGDATA
2. "mkdir /tmp/sock1 /tmp/sock2"
3. "gdb postgres", "b unlink", "run -k /tmp/sock1". Hits breakpoint.
4. "postgres -k /tmp/sock2"
5. In gdb: "d 1", "c".

The use of two socket directories is probably not essential, but it makes the
test case simpler.

The problem here is a race between concluding the assessment of a PID file as
defunct and unlinking it; during that period, another postmaster may have
replaced the PID file and proceeded. As far as I've been able to figure, this
flaw is fundamental to any PID file invalidation algorithm relying solely on
atomic filesystem operations like unlink(2), link(2), rename(2) and small
write(2) for mutual exclusion. Do any of you see a way to remove the race?

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.
PostgreSQL used fcntl(F_SETLK) on its Unix socket a decade ago, and that usage
was removed[1]http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=792b0f4666b6ea6346aa8d29b568e5d3fe1fcef5 due to portability problems[2]http://archives.postgresql.org/pgsql-hackers/2000-07/msg00359.php[3]http://archives.postgresql.org/pgsql-hackers/2000-11/msg01306.php. POSIX does not require it
to work on other than regular files, but I can't find evidence of a Unix-like
platform not supporting it for regular files. Perl's metaconfig does test it,
with a note about VMS and DOS/DJGPP lacking support. Gnulib reports only the
lack of Windows support, though the Gnulib test suite does not cover F_SETLK.

CreateLockFile() would have this algorithm:

1. open("postmaster.pid", O_RDWR | O_CREAT, 0600)
2. Request a fcntl write lock on the entire postmaster.pid file. If this
fails, abandon startup: another postmaster is running.
3. Read the PID from the file. If the read returns 0 bytes, either we created
the file or the postmaster creating it crashed before writing and syncing any
content. Such a postmaster would not have reached the point of allocating
shared memory or forking children, so we're safe to proceed in either case.
Any other content problems should never happen and can remain fatal errors.
4. Check any old PID as we do today. In theory, this should always be
redundant with the fcntl lock. However, since the PID check code is mature,
let's keep it around indefinitely as a backstop. Perhaps adjust the error
message to better reflect its "can't happen" nature, though.
5. Read the shared memory key from the file. If we find one, probe it as we
do today. Otherwise, the PID file's creating postmaster crashed before
writing and syncing that value. Such a postmaster would not have reached the
point of forking children, so we're safe to proceed.
6. ftruncate() the file and write our own content.
7. Set FD_CLOEXEC on the file, leaving it open to retain the lock.

AddToDataDirLockFile() would use the retained file descriptor. On a related
but independent note, I think the ereport(LOG) calls in that function should
be ereport(FATAL) as they are in CreateLockFile(). We're not significantly
further into the startup process, and failing to add the shared memory key to
the file creates a hazardous situation.

The hazard[4]http://archives.postgresql.org/pgsql-hackers/2012-06/msg01598.php keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

This change adds some interlock to data directories shared over NFS. It also
closes the occasionally-reported[5]http://archives.postgresql.org/pgsql-hackers/2010-08/msg01094.php[6]http://archives.postgresql.org/pgsql-hackers/2012-01/msg00233.php problem that an empty postmaster.pid
blocks cluster startup; we no longer face the possibility that an empty file
is the just-created product of a concurrent postmaster. It's probably no
longer necessary to fsync postmaster.pid, though I'm disinclined to change
that straightway.

Do any of you see holes in that design or have better ideas?

Thanks,
nm

[1]: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=792b0f4666b6ea6346aa8d29b568e5d3fe1fcef5
[2]: http://archives.postgresql.org/pgsql-hackers/2000-07/msg00359.php
[3]: http://archives.postgresql.org/pgsql-hackers/2000-11/msg01306.php
[4]: http://archives.postgresql.org/pgsql-hackers/2012-06/msg01598.php
[5]: http://archives.postgresql.org/pgsql-hackers/2010-08/msg01094.php
[6]: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00233.php

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#1)
Re: CreateLockFile() race condition

Noah Misch <noah@leadboat.com> writes:

The problem here is a race between concluding the assessment of a PID file as
defunct and unlinking it; during that period, another postmaster may have
replaced the PID file and proceeded. As far as I've been able to figure, this
flaw is fundamental to any PID file invalidation algorithm relying solely on
atomic filesystem operations like unlink(2), link(2), rename(2) and small
write(2) for mutual exclusion. Do any of you see a way to remove the race?

Nasty. Still, the issue only exists for two postmasters launched at
just about exactly the same time, which is an unlikely case.

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.

I'm a bit worried about what new problems this solution is going to open
up. It seems not unlikely that the cure is worse than the disease.
Having locking that actually works on (some) NFS setups would be nice,
but ...

The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

False. See shared_preload_libraries.

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: CreateLockFile() race condition

On Fri, Aug 3, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.

I'm a bit worried about what new problems this solution is going to open
up. It seems not unlikely that the cure is worse than the disease.
Having locking that actually works on (some) NFS setups would be nice,
but ...

The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

False. See shared_preload_libraries.

It strikes me that it would be sufficient to hold the fcntl() lock
just long enough to establish one of the other interlocks. We don't
really need to hold it for the entire lifetime of the postmaster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#2)
Re: CreateLockFile() race condition

On Fri, Aug 03, 2012 at 11:59:00AM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.

I'm a bit worried about what new problems this solution is going to open
up. It seems not unlikely that the cure is worse than the disease.

That's a fair concern. There's only so much we'll know in advance.

Having locking that actually works on (some) NFS setups would be nice,
but ...

The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

False. See shared_preload_libraries.

Quite right. Even so, that code has a special role and narrower goals to
which it can reasonable aspire, giving credibility to ignoring the problem or
documenting the problem away. (I don't see that we document any of the other
constraints on _PG_init of libraries named in shared_preload_libraries.)

Thanks,
nm