FATAL: lock file "postmaster.pid" already exists
Hi,
On Windows 2008, sometimes the server fails to start due to an existing
"postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server, and
it started up fine.
It seems to be a race-condition of sorts in the code that detects whether
the process with PID
in the file is running or not.
Does any one have this same problem? Any way to fix it besides removing
the PID file
manually each time the server complains about this?
Thanks,
Deepak
On 8 May 2012, at 24:34, deepak wrote:
Hi,
On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server, and it started up fine.
It seems to be a race-condition of sorts in the code that detects whether the process with PID
in the file is running or not.
No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.
If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.
I don't suppose that pid-file is on a remote file-system?
Does any one have this same problem? Any way to fix it besides removing the PID file
manually each time the server complains about this?
You could probably script removal of the pid file if its creation date is before the time the system started booting up.
Alban Hertroys
--
The scale of a problem often equals the size of an ego.
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haramrae@gmail.com> wrote:
On 8 May 2012, at 24:34, deepak wrote:
Hi,
On Windows 2008, sometimes the server fails to start due to an existing
"postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server,
and it started up fine.
It seems to be a race-condition of sorts in the code that detects
whether the process with PID
in the file is running or not.
No, it means that postgres wasn't shut down properly when Windows shut
down. Removing the pid-file is one of the last things the shut-down
procedure does. The file is used to prevent 2 instances of the same server
running on the same data-directory.If it's a race-condition, it's probably one in Microsoft's shutdown code.
I've seen similar problems with Outlook mailboxes on a network directory;
Windows unmounts the remote file-systems before Outlook finished updating
its files under that mount point, so Outlook throws an error message and
Windows doesn't shut down because of that.I don't suppose that pid-file is on a remote file-system?
No, it's local.
Does any one have this same problem? Any way to fix it besides removing
the PID file
manually each time the server complains about this?
You could probably script removal of the pid file if its creation date is
before the time the system started booting up.
Thanks, it looks like the code already seems to overwrite an old pid file
if no other process is using it (if I understand the code correctly, it
just echoes a byte onto a pipe to detect this).
Still, I can't see under what conditions this occurs, but I have seen it
happen a couple of times, just that I don't know how to predictably
reproduce the problem.
--
Deepak
Hi!
We could reproduce the start-up problem on Windows 2003. After a reboot,
postmaster, in its start-up sequence cleans up old temporary files, and
this step used to take several minutes (a little over 4 minutes), delaying
the writing of line 6 onwards into the PID file. This delay caused pg_ctl
to timeout, leaving behind an orphaned postgres.exe process (which
eventually forks off many other postgres.exe processes). But since pg_ctl
itself isn't running after the timeout, Windows thinks the service isn't
running. A subsequent attempt to start the service using pg_ctl now
complains about the existing lock file still being used by one of the
postgres.exe processes that was spawned before.
We have observed conclusively that file system cache is coming into play.
We tested the scenario in which a reboot was followed by navigating the
file system under the data directory using "find" Cygwin command, following
which there was "no" timeout for pg_ctl and the server started up fine,
suggesting that the clean up is way faster when the file system is cached.
Any ideas on fixing this start-up delay in postmaster?
Could the task of cleanup move elsewhere, specifically to somewhere after
the writing of PID file is complete so that pg_ctl doesn't timeout?
Any other suggestions for working around this problem?
Thanks,
Deepak
On Tue, May 8, 2012 at 12:13 PM, deepak <deepak.pn@gmail.com> wrote:
Show quoted text
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haramrae@gmail.com> wrote:
On 8 May 2012, at 24:34, deepak wrote:
Hi,
On Windows 2008, sometimes the server fails to start due to an existing
"postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server,
and it started up fine.
It seems to be a race-condition of sorts in the code that detects
whether the process with PID
in the file is running or not.
No, it means that postgres wasn't shut down properly when Windows shut
down. Removing the pid-file is one of the last things the shut-down
procedure does. The file is used to prevent 2 instances of the same server
running on the same data-directory.If it's a race-condition, it's probably one in Microsoft's shutdown code.
I've seen similar problems with Outlook mailboxes on a network directory;
Windows unmounts the remote file-systems before Outlook finished updating
its files under that mount point, so Outlook throws an error message and
Windows doesn't shut down because of that.I don't suppose that pid-file is on a remote file-system?
No, it's local.
Does any one have this same problem? Any way to fix it besides
removing the PID file
manually each time the server complains about this?
You could probably script removal of the pid file if its creation date is
before the time the system started booting up.Thanks, it looks like the code already seems to overwrite an old pid file
if no other process is using it (if I understand the code correctly, it
just echoes a byte onto a pipe to detect this).Still, I can't see under what conditions this occurs, but I have seen it
happen a couple of times, just that I don't know how to predictably
reproduce the problem.--
Deepak
deepak <deepak.pn@gmail.com> writes:
We could reproduce the start-up problem on Windows 2003. After a reboot,
postmaster, in its start-up sequence cleans up old temporary files, and
this step used to take several minutes (a little over 4 minutes), delaying
the writing of line 6 onwards into the PID file. This delay caused pg_ctl
to timeout, leaving behind an orphaned postgres.exe process (which
eventually forks off many other postgres.exe processes).
Hmm. It's easy enough to postpone temp file cleanup till after the
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?
regards, tom lane
Thanks, I have put one of the other developers working on this issue, to
comment.
--
Deepak
On Mon, May 21, 2012 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
deepak <deepak.pn@gmail.com> writes:
We could reproduce the start-up problem on Windows 2003. After a reboot,
postmaster, in its start-up sequence cleans up old temporary files, and
this step used to take several minutes (a little over 4 minutes),delaying
the writing of line 6 onwards into the PID file. This delay caused pg_ctl
to timeout, leaving behind an orphaned postgres.exe process (which
eventually forks off many other postgres.exe processes).Hmm. It's easy enough to postpone temp file cleanup till after the
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?regards, tom lane
I tried moving the call to RemovePgTempFiles until
after the PID file is fully written, but it did not help.
pg_ctl attempts to connect to the database, and does
not report the database as running until that connection
succeeds. I am not comfortable moving the call to
RemovePgTempFiles after the point in the postmaster
where child processes are spawned and connections
made available to clients because by that point the
temporary files encountered may be valid ones from
the current incarnation of Postgres and not from the
incarnation before the reboot.
I do not know precisely why the filesystem is so slow,
except to say that we have many relations:
xyzzy=# select count(*) from pg_catalog.pg_class;
count
-------
27340
(1 row)
xyzzy=# select count(*) from pg_catalog.pg_attribute;
count
--------
236252
(1 row)
Running `find . | wc -l` on the data directory gives
55219
________________________________
From: deepak <deepak.pn@gmail.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Alban Hertroys <haramrae@gmail.com>; pgsql-general@postgresql.org; markdilger@yahoo.com
Sent: Wednesday, May 23, 2012 9:03 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Thanks, I have put one of the other developers working on this issue, to comment.
--
Deepak
On Mon, May 21, 2012 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
deepak <deepak.pn@gmail.com> writes:
Show quoted text
We could reproduce the start-up problem on Windows 2003. After a reboot,
postmaster, in its start-up sequence cleans up old temporary files, and
this step used to take several minutes (a little over 4 minutes), delaying
the writing of line 6 onwards into the PID file. This delay caused pg_ctl
to timeout, leaving behind an orphaned postgres.exe process (which
eventually forks off many other postgres.exe processes).Hmm. It's easy enough to postpone temp file cleanup till after the
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes:
I tried moving the call to RemovePgTempFiles until
after the PID file is fully written, but it did not help.
I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.
If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.
(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)
regards, tom lane
Prior to posting to the mailing list, we made some
changes in postmaster.c to identify where time was
being spent. Based on the elog(NOTICE,...) lines
we put in the file, we determined the time was spent
inside RemovePgTempFiles.
I then altered RemovePgTempFiles to take a starttime
parameter and, while recursing, to check if more than
5 seconds has passed since it started. I did not want
to add the complexity of setting an alarm and catching
the signal, so I just made the code check the wallclock
time at each step of the recursion. When more than
5 seconds has passed, it does not recurse further.
After making this change, we have not been able to
reproduce the slowness.
We do not consider this a fix to the problem. It is just
a tool for verifying where the slowness comes from.
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 9:50 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
I tried moving the call to RemovePgTempFiles until
after the PID file is fully written, but it did not help.
I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.
If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.
(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)
regards, tom lane
We tried setting the timezone, as:
timezone = 'US/Eastern'
in postgresql.conf, but it did not help.
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 9:50 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
I tried moving the call to RemovePgTempFiles until
after the PID file is fully written, but it did not help.
I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.
If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.
(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes:
Prior to posting to the mailing list, we made some
changes in postmaster.c to identify where time was
being spent.� Based on the elog(NOTICE,...) lines
we put in the file, we determined the time was spent
inside RemovePgTempFiles.
I then altered RemovePgTempFiles to take a starttime
parameter and, while recursing, to check if more than
5 seconds has passed since it started.� I did not want
to add the complexity of setting an alarm and catching
the signal, so I just made the code check the wallclock
time at each step of the recursion.� When more than
5 seconds has passed, it does not recurse further.
After making this change, we have not been able to
reproduce the slowness.
OK, so we're back to the original question: how could this possibly be
taking that long? Have you got thousands of tablespaces (and if so why)?
Does your system have a habit of crashing at times when there are
thousands of temp files? Maybe you're using IP over avian carriers to
access your SAN? It just doesn't make any sense given the information
you've provided.
regards, tom lane
We do not use tablespaces at all. We do use table
partitioning very heavily, with many check
constraints. That is the only thing unusual about
the schema.
To my eyes, the birds appear to be flying pretty
darned fast, though we have not figured out how
to remove the message bands quickly without
cutting off their feet.
The server is a virtual machine, and at this point
I will ask the sys admins to get a non-virtual
server running to reconfirm the problem.
Thanks
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 11:17 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
Prior to posting to the mailing list, we made some
changes in postmaster.c to identify where time was
being spent. Based on the elog(NOTICE,...) lines
we put in the file, we determined the time was spent
inside RemovePgTempFiles.
I then altered RemovePgTempFiles to take a starttime
parameter and, while recursing, to check if more than
5 seconds has passed since it started. I did not want
to add the complexity of setting an alarm and catching
the signal, so I just made the code check the wallclock
time at each step of the recursion. When more than
5 seconds has passed, it does not recurse further.
After making this change, we have not been able to
reproduce the slowness.
OK, so we're back to the original question: how could this possibly be
taking that long? Have you got thousands of tablespaces (and if so why)?
Does your system have a habit of crashing at times when there are
thousands of temp files? Maybe you're using IP over avian carriers to
access your SAN? It just doesn't make any sense given the information
you've provided.
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes:
We do not use tablespaces at all.
[ scratches head... ] If you aren't using any tablespaces, there should
be only *one* pgsql_tmp directory, which makes this even more confusing.
(Unless you're using a pre-8.3 release, in which case there would be one
per database, so maybe if you've got hundreds/thousands of databases in
the cluster that would explain it. But I sure hope you're not still
using pre-8.3, especially not on Windows.)
regards, tom lane
We only use one database, not counting the
built-in template databases. The server is
running 9.1.3. We were running 9.1.1 until
fairly recently.
We are still getting set up to test this on
non-virtual hardware, but hope to have results
from that in a few hours or less.
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 12:23 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
We do not use tablespaces at all.
[ scratches head... ] If you aren't using any tablespaces, there should
be only *one* pgsql_tmp directory, which makes this even more confusing.
(Unless you're using a pre-8.3 release, in which case there would be one
per database, so maybe if you've got hundreds/thousands of databases in
the cluster that would explain it. But I sure hope you're not still
using pre-8.3, especially not on Windows.)
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes:
We only use one database, not counting the
built-in template databases.� The server is
running 9.1.3.� We were running 9.1.1 until
fairly recently.
OK. I had forgotten that in recent versions, RemovePgTempFiles doesn't
only iterate through the pgsql_tmp directories; it scans the regular
database directories too, looking for possibly orphaned temp relations.
So if you had lots and lots of files in your regular database
directories, possibly scanning those could be slow. Still, it's only
looking at the file names, not attempting to stat() them or anything,
so it would be a pretty shoddy filesystem that would take a really long
time for that.
regards, tom lane
I am running this code on Windows 2003. It
appears that postgres has in src/port/dirent.c
a port of readdir() that internally uses the
WIN32_FIND_DATA structure, and the function
FindNextFile() to iterate through the directory.
Looking at the documentation, it seems that
this function does collect file creation time,
last access time, last write time, file size, etc.,
much like performing a stat.
In my case, the code is iterating through roughly
56,000 files. Apparently, this is doing the
equivalent of a stat on each of them.
See http://msdn.microsoft.com/en-us/library/windows/desktop/aa365740%28v=vs.85%29.aspx
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 1:54 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
We only use one database, not counting the
built-in template databases. The server is
running 9.1.3. We were running 9.1.1 until
fairly recently.
OK. I had forgotten that in recent versions, RemovePgTempFiles doesn't
only iterate through the pgsql_tmp directories; it scans the regular
database directories too, looking for possibly orphaned temp relations.
So if you had lots and lots of files in your regular database
directories, possibly scanning those could be slow. Still, it's only
looking at the file names, not attempting to stat() them or anything,
so it would be a pretty shoddy filesystem that would take a really long
time for that.
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes:
I am running this code on Windows 2003.� It
appears that postgres has in src/port/dirent.c
a port of readdir() that internally uses the
WIN32_FIND_DATA structure, and the function
FindNextFile() to iterate through the directory.
Looking at the documentation, it seems that
this function does collect file creation time,
last access time, last write time, file size, etc.,
much like performing a stat.
In my case, the code is iterating through roughly
56,000 files. Apparently, this is doing the
equivalent of a stat on each of them.
That would explain it all right. I think you're basically screwed here,
because so far as I can see Windows doesn't provide any means to
enumerate a directory's contents without fetching that info; at least
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx
doesn't seem to offer any substitutes for FindFirstFile/FindNextFile.
It's barely possible that using FindFirstFileEx with fInfoLevelId =
FindExInfoBasic would save enough to be useful, except that that option
doesn't exist on Windows 2003 anyway.
Consider using another operating system ...
regards, tom lane
FindFirstFile can take a wildcard filename
pattern. It appears that we are effectively
calling FindFirstFile without a pattern, getting
all 56000 file names with complete stat
information, doing a poor-man's regex on
those names, and matching just the temporary
files.
If RemovePgTempFiles were modified to
pass a filter, this code might perform better
on Windows. I'll look into this.
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 4:25 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
I am running this code on Windows 2003. It
appears that postgres has in src/port/dirent.c
a port of readdir() that internally uses the
WIN32_FIND_DATA structure, and the function
FindNextFile() to iterate through the directory.
Looking at the documentation, it seems that
this function does collect file creation time,
last access time, last write time, file size, etc.,
much like performing a stat.
In my case, the code is iterating through roughly
56,000 files. Apparently, this is doing the
equivalent of a stat on each of them.
That would explain it all right. I think you're basically screwed here,
because so far as I can see Windows doesn't provide any means to
enumerate a directory's contents without fetching that info; at least
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx
doesn't seem to offer any substitutes for FindFirstFile/FindNextFile.
It's barely possible that using FindFirstFileEx with fInfoLevelId =
FindExInfoBasic would save enough to be useful, except that that option
doesn't exist on Windows 2003 anyway.
Consider using another operating system ...
regards, tom lane
On Thu, May 24, 2012 at 12:47 AM, Mark Dilger <markdilger@yahoo.com> wrote:
I am running this code on Windows 2003. It
appears that postgres has in src/port/dirent.c
a port of readdir() that internally uses the
WIN32_FIND_DATA structure, and the function
FindNextFile() to iterate through the directory.
Looking at the documentation, it seems that
this function does collect file creation time,
last access time, last write time, file size, etc.,
much like performing a stat.In my case, the code is iterating through roughly
56,000 files. Apparently, this is doing the
equivalent of a stat on each of them.
how did you end up with 56,000 files? Lots and lots and lots of tables?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Thu, May 24, 2012 at 2:42 AM, Mark Dilger <markdilger@yahoo.com> wrote:
FindFirstFile can take a wildcard filename
pattern. It appears that we are effectively
calling FindFirstFile without a pattern, getting
all 56000 file names with complete stat
information, doing a poor-man's regex on
those names, and matching just the temporary
files.If RemovePgTempFiles were modified to
pass a filter, this code might perform better
on Windows. I'll look into this.
It might in that case be worthwhile looking at using scandir() on
platforms that support that as well, so that other platforms can
benefit from an optimization as well. Though I'm not sure how much
that would actually help - ISTM that one actually scans the whole
directory anyway, just you don't have to do it yourself...
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/