"Bogus data in lock file" shouldn't be FATAL?

Started by Tom Laneover 15 years ago7 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

This complaint:
http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php

seems to suggest that this code in CreateLockFile() is not well-thought-out:

if (other_pid <= 0)
elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",
filename, buffer);

as it means that a corrupted (empty, in this case) postmaster.pid file
prevents the server from starting until somebody intervenes manually.

I think that the original concern was that if we couldn't read valid
data out of postmaster.pid then we couldn't be sure if there were a
conflicting postmaster running. But if that's the plan then
CreateLockFile is violating it further down, where it happily skips the
PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from
the file.

We could perhaps address that risk another way: after having written
postmaster.pid, try to read it back to verify that it contains what we
wrote, and abort if not. Then, if we can't read it during startup,
it's okay to assume there is no conflicting postmaster.

Or, given the infrequency of complaints, maybe it's better not to touch
this. Thoughts?

regards, tom lane

#2Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: "Bogus data in lock file" shouldn't be FATAL?

On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This complaint:
http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php

seems to suggest that this code in CreateLockFile() is not well-thought-out:

               if (other_pid <= 0)
                       elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",
                                filename, buffer);

as it means that a corrupted (empty, in this case) postmaster.pid file
prevents the server from starting until somebody intervenes manually.

I think that the original concern was that if we couldn't read valid
data out of postmaster.pid then we couldn't be sure if there were a
conflicting postmaster running.  But if that's the plan then
CreateLockFile is violating it further down, where it happily skips the
PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from
the file.

We could perhaps address that risk another way: after having written
postmaster.pid, try to read it back to verify that it contains what we
wrote, and abort if not.  Then, if we can't read it during startup,
it's okay to assume there is no conflicting postmaster.

What if it was readable when written but has since become unreadable?

My basic feeling on this is that manual intervention to start the
server is really undesirable and we should try hard to avoid needing
it. That having been said, accidentally starting two postmasters at
the same time that are accessing the same data files would be several
orders of magnitude worse. We can't afford to compromise on any
interlock mechanisms that are necessary to prevent that from
happening.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: "Bogus data in lock file" shouldn't be FATAL?

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We could perhaps address that risk another way: after having written
postmaster.pid, try to read it back to verify that it contains what we
wrote, and abort if not. �Then, if we can't read it during startup,
it's okay to assume there is no conflicting postmaster.

What if it was readable when written but has since become unreadable?

Yup, that's the weak spot in any such assumption. One might also draw
an analogy to the case of failing to open postmaster.pid because of
permissions change, which seems at least as likely as a data change.
And we consider that as fatal, for good reason I think.

My basic feeling on this is that manual intervention to start the
server is really undesirable and we should try hard to avoid needing
it. That having been said, accidentally starting two postmasters at
the same time that are accessing the same data files would be several
orders of magnitude worse. We can't afford to compromise on any
interlock mechanisms that are necessary to prevent that from
happening.

Yeah. At the same time, it's really really bad to encourage people to
remove postmaster.pid manually as the first attempt to fix anything.
That completely destroys whatever interlock you thought you had. So
it's not too hard to make a case that avoiding this scenario will really
make things safer not less so.

The bottom line here is that it's not clear to me whether changing this
would be a net reliability improvement or not. Maybe better to leave
it alone.

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: "Bogus data in lock file" shouldn't be FATAL?

Excerpts from Tom Lane's message of lun ago 16 11:18:32 -0400 2010:

We could perhaps address that risk another way: after having written
postmaster.pid, try to read it back to verify that it contains what we
wrote, and abort if not. Then, if we can't read it during startup,
it's okay to assume there is no conflicting postmaster.

Makes sense.

BTW some past evidence:
http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: "Bogus data in lock file" shouldn't be FATAL?

Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010:

The bottom line here is that it's not clear to me whether changing this
would be a net reliability improvement or not. Maybe better to leave
it alone.

In that case, maybe consider fsync'ing it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: "Bogus data in lock file" shouldn't be FATAL?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010:

The bottom line here is that it's not clear to me whether changing this
would be a net reliability improvement or not. Maybe better to leave
it alone.

In that case, maybe consider fsync'ing it.

Hrm ... I had supposed we did fsync lockfiles, but a look at the code
shows not. This seems like a clear oversight. I think we should not
only add that, but back-patch it. It seems entirely plausible that the
lack of an fsync is exactly what led to the original complaint.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: "Bogus data in lock file" shouldn't be FATAL?

Alvaro Herrera <alvherre@commandprompt.com> writes:

In that case, maybe consider fsync'ing it.

BTW, I see you already proposed that in the thread at
http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com
I'm not sure how come the idea fell through the cracks, but we should
surely have done it then. I think I was reading the initial complaint
as being just logfile corruption, and failed to make the connection to
the postmaster.pid file.

regards, tom lane