Better detection of staled postmaster.pid

Started by Pavel Raiskupover 10 years ago4 messages
#1Pavel Raiskup
praiskup@redhat.com

This is most likely just a request for brainstorm.

It's been reported [1]https://bugzilla.redhat.com/show_bug.cgi?id=1257334 that postmaster fails to start against staled
postmaster.pid after (e.g.) power outage on Fedora, its due to init system
parallelism and "some" other newly started process can already have allocated
the same PID as the old postmaster had -- and in this case postmaster refuses
to delete staled pidfile (which is expected as we need to be really
careful).

Don't you see some other possible check we could implement to guarantee that
the PID mentioned in postmaster.pid does not hide concurrent postmaster?
I can think of /proc/<CONCURRENT_PID>/cmdline parsing for possible '-D' option
occurrence, but that is not terribly portable and it could be considered
racy, or? Some acceptable hack we could use to tell to other processes
that we are running particular data directory?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1257334

Pavel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Kevin Grittner
kgrittn@ymail.com
In reply to: Pavel Raiskup (#1)
Re: Better detection of staled postmaster.pid

Pavel Raiskup <praiskup@redhat.com> wrote:

It's been reported [1] that postmaster fails to start against staled
postmaster.pid after (e.g.) power outage on Fedora, its due to init system
parallelism and "some" other newly started process can already have allocated
the same PID as the old postmaster had -- and in this case postmaster refuses
to delete staled pidfile (which is expected as we need to be really
careful).

Don't you see some other possible check we could implement to guarantee that
the PID mentioned in postmaster.pid does not hide concurrent postmaster?

Was the other newly started process another PostgreSQL cluster?
Was it launched under the same OS user? (Those are the only
conditions under which I've seen this.) I think it is wise to use
a separate OS user for each cluster.

If it's not a matter of multiple clusters running under the same OS
user, please provide more deails, like the specific version and
copy/paste of error messages and relevant log entries.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Kevin Grittner (#2)
Re: Better detection of staled postmaster.pid

On Mon, Aug 31, 2015 at 10:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Pavel Raiskup <praiskup@redhat.com> wrote:

It's been reported [1] that postmaster fails to start against staled
postmaster.pid after (e.g.) power outage on Fedora, its due to init

system

parallelism and "some" other newly started process can already have

allocated

the same PID as the old postmaster had -- and in this case postmaster

refuses

to delete staled pidfile (which is expected as we need to be really
careful).

Don't you see some other possible check we could implement to guarantee

that

the PID mentioned in postmaster.pid does not hide concurrent postmaster?

​Most of this can be gleamed from the linked bug report.​..

Was the other newly started process another PostgreSQL cluster?

Yes​​

Was it launched under the same OS user? (Those are the only

conditions under which I've seen this.) I think it is wise to use
a separate OS user for each cluster.

​Yes. Does the pid check that the owner of the pid file match the owner of
the process? While seemingly good advice I'm not sure how it would prevent
this scenario - likely due to lack of knowledge on my part.

If it's not a matter of multiple clusters running under the same OS
user, please provide more deails, like the specific version and
copy/paste of error messages and relevant log entries

​See report. I get not having transient data linked to in these kinds of
postings but the supplied description and official downstream project bug
report seem like sufficient data work operate from even if only in a
preliminary fashion.

The only obvious solution is to stop using (pid) as a primary key of sorts
and use (pid, timecreated) instead. After a restart/reboot the timecreated
would be guaranteed to have changed and no guessing would be involved.
That seems invasive, though proper, for a problem largely limited to an
uncommon distribution-specific setup that requires a unclean shutdown to
occur.

David J.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#2)
Re: Better detection of staled postmaster.pid

Kevin Grittner <kgrittn@ymail.com> writes:

Pavel Raiskup <praiskup@redhat.com> wrote:

It's been reported [1] that postmaster fails to start against staled
postmaster.pid after (e.g.) power outage on Fedora,

Was the other newly started process another PostgreSQL cluster?
Was it launched under the same OS user?

Yes, that's what the bugzilla report indicated.

(Those are the only
conditions under which I've seen this.) I think it is wise to use
a separate OS user for each cluster.

That's my recommendation too. The only other thing you could do to
prevent this would be to manually blow away postmaster.pid files,
and please listen to this: that cure is a lot worse than the disease.
It's almost never implemented safely (that is, in a way that guarantees
the forcible removal can *only* happen at system boot and never later).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers