file-locking and postmaster.pid

Started by Andreas Joseph Kroghalmost 20 years ago39 messageshackers

andreak@officenet.no

almost 20 years ago

Hi all.

I've experienced several times that PG has died somehow and the postmaster.pid
file still exists 'cause PG hasn't had the ability to delete it upon proper
shutdown. Upon start-up, after such an incidence, PG tells me another PG is
running and that I either have to shut down the other instance, or delete the
postmaster.pid file if there really isn't an instance running. This seems
totally unnecessary to me. Why doesn't PG use file-locking to tell if another
PG is running or not? If PG holds an exclusive-lock on the pid-file and the
process crashes, or shuts down, then the lock(which is process-based and
controlled by the kernel) will be removed and another PG which tries to start
up can detect that. Using the existence of the pid-file as the only evidence
gives too many false positives IMO.

I'm sure there's a good reason for having it the way it is, having so many
smart knowledgeable people working on this project. Could someone please
explain the rationale of the current solution to me?

--
Andreas Joseph Krogh <andreak@officenet.no>
Senior Software Developer / Manager
gpg public_key: http://dev.officenet.no/~andreak/public_key.asc
------------------------+---------------------------------------------+
OfficeNet AS | The most difficult thing in the world is to |
Hoffsveien 17 | know how to do a thing and to watch |
PO. Box 425 Skøyen | somebody else doing it wrong, without |
0213 Oslo | comment. |
NORWAY | |
Phone : +47 22 13 01 00 | |
Direct: +47 22 13 10 03 | |
Mobile: +47 909 56 963 | |
------------------------+---------------------------------------------+

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Andreas Joseph Krogh (#1)

Re: file-locking and postmaster.pid

Andreas Joseph Krogh <andreak@officenet.no> writes:

I've experienced several times that PG has died somehow and the postmaster.pid
file still exists 'cause PG hasn't had the ability to delete it upon proper
shutdown. Upon start-up, after such an incidence, PG tells me another PG is
running and that I either have to shut down the other instance, or delete the
postmaster.pid file if there really isn't an instance running. This seems
totally unnecessary to me.

The postmaster does check to see whether the PID mentioned in the file
is still alive, so it's not that easy for the above to happen. If you
can provide details of a scenario where a failure is likely, we'd like
to know about it. Also, what PG version are you talking about?

Why doesn't PG use file-locking to tell if another
PG is running or not?

Portability.

regards, tom lane

Andreas Joseph Krogh

andreak@officenet.no

almost 20 years ago

In reply to: Tom Lane (#2)

Re: file-locking and postmaster.pid

On Tuesday 23 May 2006 17:54, Tom Lane wrote:

Andreas Joseph Krogh <andreak@officenet.no> writes:

I've experienced several times that PG has died somehow and the
postmaster.pid file still exists 'cause PG hasn't had the ability to
delete it upon proper shutdown. Upon start-up, after such an incidence,
PG tells me another PG is running and that I either have to shut down the
other instance, or delete the postmaster.pid file if there really isn't
an instance running. This seems totally unnecessary to me.

The postmaster does check to see whether the PID mentioned in the file
is still alive, so it's not that easy for the above to happen. If you
can provide details of a scenario where a failure is likely, we'd like
to know about it. Also, what PG version are you talking about?

I have experienced this with PG-8.1.3 and will provide details if I can make
it happen. Basically it has happened when I have had to "hard-reset" my
laptop due to some strange bugs in Linux which have made it hang.

Why doesn't PG use file-locking to tell if another
PG is running or not?

Portability.

Ok.

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Andreas Joseph Krogh (#3)

Re: file-locking and postmaster.pid

Andreas Joseph Krogh <andreak@officenet.no> writes:

On Tuesday 23 May 2006 17:54, Tom Lane wrote:

The postmaster does check to see whether the PID mentioned in the file
is still alive, so it's not that easy for the above to happen. If you
can provide details of a scenario where a failure is likely, we'd like
to know about it. Also, what PG version are you talking about?

I have experienced this with PG-8.1.3 and will provide details if I can make
it happen. Basically it has happened when I have had to "hard-reset" my
laptop due to some strange bugs in Linux which have made it hang.

If you're talking about a postmaster that's auto-started during the boot
sequence, then there is a risk depending on what start script you use.
The problem is that depending on what else runs during the system
startup, the PID assigned to the postmaster might be the same as in the
last boot cycle, or it might be different by one or two counts. The
postmaster disregards a pidfile containing its own PID, or its parent
process' PID, or a PID not belonging to a postgres-owned process.
That covers most cases but if your start script does something like

su -l postgres -c "pg_ctl start ..."

then you have a situation where not only the parent process (pg_ctl)
but also the grandparent (a shell) is postgres-owned, and if the pidfile
PID happens to match the grandparent then you lose. Solution is to
either not use pg_ctl here, or write "exec pg_ctl start ...", so that
there's only one postgres-owned process besides the postmaster itself.

Initscripts published by PGDG itself and by Red Hat have gotten this
right for awhile, but I suspect the word has not propagated to all
distros.

regards, tom lane

Adis Nezirovic

adis@linux.org.ba

almost 20 years ago

In reply to: Andreas Joseph Krogh (#1)

Re: file-locking and postmaster.pid

On Tue, May 23, 2006 at 05:23:16PM +0200, Andreas Joseph Krogh wrote:

Hi all.

I've experienced several times that PG has died somehow and the postmaster.pid
file still exists 'cause PG hasn't had the ability to delete it upon proper
shutdown. Upon start-up, after such an incidence, PG tells me another PG is
running and that I either have to shut down the other instance, or delete the
postmaster.pid file if there really isn't an instance running. This seems
totally unnecessary to me. Why doesn't PG use file-locking to tell if another
PG is running or not? If PG holds an exclusive-lock on the pid-file and the
process crashes, or shuts down, then the lock(which is process-based and
controlled by the kernel) will be removed and another PG which tries to start
up can detect that. Using the existence of the pid-file as the only evidence
gives too many false positives IMO.

Well, maybe you could tweak postgres startup script, add check for post
master (either 'pgrep postmaster' or 'ps -axu | grep [p]ostmaster'), and
delete pid file on negative results.

i.e.

#!/bin/bash
PID=`pgrep -f /usr/bin/postmaster`;

if [[ $PID ]]; then
echo "'$PID'";
# postgres is already running
else
echo "Postmaster is not running";
# delete stale PID file
fi

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Adis Nezirovic (#5)

Re: file-locking and postmaster.pid

Adis Nezirovic <adis@linux.org.ba> writes:

Well, maybe you could tweak postgres startup script, add check for post
master (either 'pgrep postmaster' or 'ps -axu | grep [p]ostmaster'), and
delete pid file on negative results.

This is exactly what you should NOT do.

A start script that thinks it is smarter than the postmaster is almost
certainly wrong. It is certainly dangerous, too, because auto-deleting
that pidfile destroys the interlock against having two postmasters
running in the same data directory (which WILL corrupt your data,
quickly and irretrievably). All it takes to cause a problem is to
use the start script to start a postmaster, forgetting that you already
have one running ...

regards, tom lane

Adis Nezirovic

adis@linux.org.ba

almost 20 years ago

In reply to: Tom Lane (#6)

Re: file-locking and postmaster.pid

On Tue, May 23, 2006 at 01:36:41PM -0400, Tom Lane wrote:

This is exactly what you should NOT do.

A start script that thinks it is smarter than the postmaster is almost
certainly wrong. It is certainly dangerous, too, because auto-deleting
that pidfile destroys the interlock against having two postmasters
running in the same data directory (which WILL corrupt your data,
quickly and irretrievably). All it takes to cause a problem is to
use the start script to start a postmaster, forgetting that you already
have one running ...

I do agree with you that we should not play games with postmaster.
Better to be safe than sorry. (So, manually deleting pid file is the
only safe option). I was just suggestion (possibly dangerous)
workaround.

Btw, I do check for running postmaster, using full path (I don't wan to
kill every postmaster on the system), is this safe? Or there could be
race condition?

Andreas Joseph Krogh

andreak@officenet.no

almost 20 years ago

In reply to: Tom Lane (#6)

Re: file-locking and postmaster.pid

On Tuesday 23 May 2006 19:36, Tom Lane wrote:

Adis Nezirovic <adis@linux.org.ba> writes:

Well, maybe you could tweak postgres startup script, add check for post
master (either 'pgrep postmaster' or 'ps -axu | grep [p]ostmaster'), and
delete pid file on negative results.

This is exactly what you should NOT do.

A start script that thinks it is smarter than the postmaster is almost
certainly wrong. It is certainly dangerous, too, because auto-deleting
that pidfile destroys the interlock against having two postmasters
running in the same data directory (which WILL corrupt your data,
quickly and irretrievably). All it takes to cause a problem is to
use the start script to start a postmaster, forgetting that you already
have one running ...

My PG is not started with startup-scripts, but with this command:

pg_ctl -D $PGDATA -l $PGDIR/log/logfile-`date +%Y-%m-%d`.log start

Andreas Joseph Krogh

andreak@officenet.no

almost 20 years ago

In reply to: Andreas Joseph Krogh (#8)

Re: file-locking and postmaster.pid

On Wednesday 24 May 2006 11:36, Andreas Joseph Krogh wrote:

On Tuesday 23 May 2006 19:36, Tom Lane wrote:

Adis Nezirovic <adis@linux.org.ba> writes:

Well, maybe you could tweak postgres startup script, add check for post
master (either 'pgrep postmaster' or 'ps -axu | grep [p]ostmaster'),
and delete pid file on negative results.

This is exactly what you should NOT do.

A start script that thinks it is smarter than the postmaster is almost
certainly wrong. It is certainly dangerous, too, because auto-deleting
that pidfile destroys the interlock against having two postmasters
running in the same data directory (which WILL corrupt your data,
quickly and irretrievably). All it takes to cause a problem is to
use the start script to start a postmaster, forgetting that you already
have one running ...

My PG is not started with startup-scripts, but with this command:

pg_ctl -D $PGDATA -l $PGDIR/log/logfile-`date +%Y-%m-%d`.log start

... and manually after login, ie. not at boot-time.

#10

Andrej Ricnik-Bay

andrej.groups@gmail.com

almost 20 years ago

In reply to: Andreas Joseph Krogh (#9)

Re: file-locking and postmaster.pid

On 5/24/06, Andreas Joseph Krogh <andreak@officenet.no> wrote:

My PG is not started with startup-scripts, but with this command:

pg_ctl -D $PGDATA -l $PGDIR/log/logfile-`date +%Y-%m-%d`.log start

... and manually after login, ie. not at boot-time.

I'd suggest trying to fix your Linux-install instead of mucking
about with Postgres, and this really a pgsql-novice question,
not a -hackers thing.

Cheers,
Andrej

--
Please don't top post, and don't use HTML e-Mail :} Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

#11

korry

korry@appx.com

almost 20 years ago

In reply to: Andreas Joseph Krogh (#1)

Re: file-locking and postmaster.pid

I'm sure there's a good reason for having it the way it is, having so many
smart knowledgeable people working on this project. Could someone please
explain the rationale of the current solution to me?

We've ignored Andreas' original question. Why not use a lock to
indicate that the postmaster is still running? At first blush, that
seems more reliable than checking for a (possibly recycled) process ID.

-- Korry

#12

Andreas Joseph Krogh

andreak@officenet.no

almost 20 years ago

In reply to: korry (#11)

Re: file-locking and postmaster.pid

On Wednesday 24 May 2006 21:03, korry wrote:

I'm sure there's a good reason for having it the way it is, having so
many smart knowledgeable people working on this project. Could someone
please explain the rationale of the current solution to me?

We've ignored Andreas' original question. Why not use a lock to
indicate that the postmaster is still running? At first blush, that
seems more reliable than checking for a (possibly recycled) process ID.

As Tom replied: Portability.

#13

Andreas Joseph Krogh

andreak@officenet.no

almost 20 years ago

In reply to: Andrej Ricnik-Bay (#10)

Re: file-locking and postmaster.pid

On Wednesday 24 May 2006 20:52, Andrej Ricnik-Bay wrote:

On 5/24/06, Andreas Joseph Krogh <andreak@officenet.no> wrote:

My PG is not started with startup-scripts, but with this command:

pg_ctl -D $PGDATA -l $PGDIR/log/logfile-`date +%Y-%m-%d`.log start

... and manually after login, ie. not at boot-time.

I'd suggest trying to fix your Linux-install instead of mucking
about with Postgres, and this really a pgsql-novice question,
not a -hackers thing.

I'm sorry, can't resist, but this has to be *the* dumbest reply to these sort
of questions. What makes you think it *only* happens when linux freezes(btw,
I suspect my NVIDIA-driver to be the problem on my laptop, not Linux itself).
Still - PG *should* handle that situation too, it's like a power outage. I've
been using Linux exclusively since '96 and PG since 6.5, so I don't consider
myself a novice in neither. Why PG doesn't use locking *is* definitely
a -hackers thing.

#14

korry

korry@appx.com

almost 20 years ago

In reply to: Andreas Joseph Krogh (#12)

Re: file-locking and postmaster.pid

On Wednesday 24 May 2006 21:03, korry wrote:

I'm sure there's a good reason for having it the way it is, having so
many smart knowledgeable people working on this project. Could someone
please explain the rationale of the current solution to me?

We've ignored Andreas' original question. Why not use a lock to
indicate that the postmaster is still running? At first blush, that
seems more reliable than checking for a (possibly recycled) process ID.

As Tom replied: Portability.

Thanks - I missed that part of Tom's message.

The only platform (although certainly not a minor issue) that I can
think of that would have a portability issue would be Win32. You can't
even read a locked byte in Win32. I usually solve that problem by
locking a byte past the end of the file (which is portable).

Is there some other portability issue that I'm missing?

-- Korry

#15

Alvaro Herrera

alvherre@2ndquadrant.com

almost 20 years ago

In reply to: korry (#14)

Re: file-locking and postmaster.pid

korry wrote:

The only platform (although certainly not a minor issue) that I can
think of that would have a portability issue would be Win32. You can't
even read a locked byte in Win32. I usually solve that problem by
locking a byte past the end of the file (which is portable).

Certainly on all platforms there must be *some* locking primitive. We
just need to figure out the appropiate parameters to fcntl() or flock()
or lockf() on each.

The Win32 API for locking seems mighty strange to me.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#16

korry

korry@appx.com

almost 20 years ago

In reply to: Alvaro Herrera (#15)

Re: file-locking and postmaster.pid

Certainly on all platforms there must be *some* locking primitive. We
just need to figure out the appropiate parameters to fcntl() or flock()
or lockf() on each.

Right.

The Win32 API for locking seems mighty strange to me.

Linux/Unix byte locking is advisory (meaning that one lock can block
another lock, but it can't block a read). Win32 locking is mandatory
(at least in the most portable form) so a lock blocks a reader. To
avoid that problem, youlock a byte that you never intend to read (that
is, you lock a byte past the end of the file). Locking past the
end-of-file is portable to all Unix/Linux systems that I've seen (that
way, you can lock a region of a file before you grow the file).

-- Korry

#17

Andrew Dunstan

andrew@dunslane.net

almost 20 years ago

In reply to: Alvaro Herrera (#15)

Re: file-locking and postmaster.pid

Alvaro Herrera wrote:

korry wrote:

The only platform (although certainly not a minor issue) that I can
think of that would have a portability issue would be Win32. You can't
even read a locked byte in Win32. I usually solve that problem by
locking a byte past the end of the file (which is portable).

Certainly on all platforms there must be *some* locking primitive. We
just need to figure out the appropiate parameters to fcntl() or flock()
or lockf() on each.

The Win32 API for locking seems mighty strange to me.

We use file locking on Win32 (and on all other platforms) in the
buildfarm ... it's done from perl so maybe perl does some magic under
the hood. The call looks just the same, and works fine on W32, I
believe. It is roughly:

use Fcntl qw(:flock);
open($lockfile,">builder.LCK") || die "opening lockfile";
exit(0) unless flock($lockfile,LOCK_EX|LOCK_NB);

cheers

andrew

#18

Alvaro Herrera

alvherre@2ndquadrant.com

almost 20 years ago

In reply to: korry (#16)

Re: file-locking and postmaster.pid

korry wrote:

The Win32 API for locking seems mighty strange to me.

Linux/Unix byte locking is advisory (meaning that one lock can block
another lock, but it can't block a read).

No -- it is advisory meaning that a process that does not try to acquire
the lock is not locked out. You can certainly block a file in exclusive
mode, using the LOCK_EX flag. (And at least on my Linux system, there
is mandatory locking too, using the fcntl() interface).

I think the next question is -- how would the lock interface be used?
We could acquire an exclusive lock on postmaster start (to make sure no
backend is running), then reduce it to a shared lock. Every backend
would inherit the shared lock. But the lock exchange is not guaranteed
to be atomic so a new postmaster could start just after we acquire the
lock and acquire the shared lock. It'd need to be complemented with
another lock.

Win32 locking is mandatory (at least in the most portable form) so a
lock blocks a reader.

There is also shared/exclusive locking of a file on Win32. My comment
weas more directed at the fact that you have to "create some sort of
lock handle" from a file handle and then lock the lock handle, or
something like that. I don't recall the exact details but it was
strange (as opposed to just open and then flock).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Alvaro Herrera (#15)

Re: file-locking and postmaster.pid

Alvaro Herrera <alvherre@commandprompt.com> writes:

Certainly on all platforms there must be *some* locking primitive. We
just need to figure out the appropiate parameters to fcntl() or flock()
or lockf() on each.

Quite aside from the hassle factor of needing to deal with N variants of
the syscalls, I'm not convinced that it's guaranteed to work. ISTR that
for instance NFS file locking is pretty much Alice-in-Wonderland :-(

Since the entire point here is to have a guaranteed bulletproof check,
locks that work most of the time on most platforms/filesystems aren't
gonna be an improvement.

regards, tom lane

#20

Alvaro Herrera

alvherre@2ndquadrant.com

almost 20 years ago

In reply to: Andrew Dunstan (#17)

Re: file-locking and postmaster.pid

Andrew Dunstan wrote:

We use file locking on Win32 (and on all other platforms) in the
buildfarm ... it's done from perl so maybe perl does some magic under
the hood. The call looks just the same, and works fine on W32, I
believe. It is roughly:

use Fcntl qw(:flock);
open($lockfile,">builder.LCK") || die "opening lockfile";
exit(0) unless flock($lockfile,LOCK_EX|LOCK_NB);

flock on Perl is implemented using platform-dependent system calls. Per
the docs,

flock FILEHANDLE,OPERATION
Calls flock(2), or an emulation of it, on FILEHANDLE. Returns
true for success, false on failure. Produces a fatal error if
used on a machine that doesn't implement flock(2), fcntl(2)
locking, or lockf(3). "flock" is Perl's portable file locking
interface, although it locks only entire files, not records.

Note that it may fail! This seems to indicate that some platforms do
not provide either locking mechanism.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#21

Alvaro Herrera

alvherre@2ndquadrant.com

almost 20 years ago

In reply to: Alvaro Herrera (#20)

#22

Andrew Dunstan

andrew@dunslane.net

almost 20 years ago

In reply to: Alvaro Herrera (#21)