POSIX shared memory redux

Started by A.M.about 15 years ago26 messages

A.M.

agentm@themactionfaction.com

about 15 years ago

The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostly by Tom Lane.

Branch: http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=shortlog;h=refs/heads/posix_shmem
Main file: http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=blob;f=src/backend/port/posix_shmem.c;h=da93848d14eeadb182d8bf1fe576d741ae5792c3;hb=refs/heads/posix_shmem

Design goals:
1) ensure that shared memory creation collisions are impossible
2) ensure that shared memory access collisions are impossible
3) ensure proper shared memory cleanup after backend and postmaster close
4) minimize API changes
http://archives.postgresql.org/pgsql-patches/2007-02/msg00527.php
http://archives.postgresql.org/pgsql-patches/2007-02/msg00558.php

This patch addresses the above goals and offers some benefits over SysV shared memory:

1) no kern.sysv management (one documentation page with platform-specific help can disappear)
2) shared memory allocation limited only by mmap usage
3) shared memory regions are completely cleaned up when the postmaster and all of its children are exited or killed for any reason (including SIGKILL)
4) shared memory creation race conditions or collisions between postmasters or backends are impossible
5) after postmaster startup, the postmaster becomes the sole arbiter of which other processes are granted access to the shared memory region
6) mmap and munmap can be used on the shared memory region- this may be useful for offering the option to expand the memory region dynamically

The design goals are accomplished by a simple change in shared memory creation: after shm_open, the region name is immediately shm_unlink'd. Because POSIX shared memory relies on file descriptors, the shared memory is not deallocated in the kernel until the last referencing file descriptor is closed (in this case, on process exit). The postmaster then becomes the sole arbiter of passing the shared memory file descriptor (either through children or through file descriptor passing, if necessary).

The patch is a reworked version of Chris Marcellino <cmarcellino@apple.com>'s patch.

Details:

1) the shared memory name is based on getpid()- this ensures that no two starting postmasters (or other processes) will attempt to acquire the same shared memory segment.
2) the shared memory segment is created and immediately unlinked, preventing outside access to the shared memory region
3) the shared memory file descriptor is passed to backends via static int file descriptor (normal file descriptor inheritance)
* perhaps there is a better location to store the file descriptor- advice welcomed.
4) shared memory segment detach occurs when the process exits (kernel-based cleanup instead of scheduled in-process clean up)

Additional notes:
The "feature" whereby arbitrary postgres user processes could connect to the shared memory segment has been removed with this patch. If this is a desirable feature (perhaps for debugging or performance tools), this could be added by implementing a file descriptor passing server in the postmaster which would use SCM_RIGHTS control message passing to a) verify that the remote process is running as the same user as the postmaster b) pass the shared memory file descriptor to the process. I am happy to implement this, if required.

I am happy to continue work on this patch if the pg-hackers deem it worthwhile. Thanks!

Cheers,
M

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: A.M. (#1)

Re: POSIX shared memory redux

"A.M." <agentm@themactionfaction.com> writes:

The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostly by Tom Lane.

It seems like you've failed to understand the main shortcoming of this
whole idea, which is the loss of ability to detect pre-existing backends
still running in a cluster whose postmaster has crashed. The nattch
variable of SysV shmem segments is really pretty critical to us, and
AFAIK there simply is no substitute for it in POSIX-land.

regards, tom lane

Martijn van Oosterhout

kleptog@svana.org

about 15 years ago

In reply to: Tom Lane (#2)

Re: POSIX shared memory redux

On Sat, Nov 13, 2010 at 08:07:52PM -0500, Tom Lane wrote:

"A.M." <agentm@themactionfaction.com> writes:

The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostly by Tom Lane.

It seems like you've failed to understand the main shortcoming of this
whole idea, which is the loss of ability to detect pre-existing backends
still running in a cluster whose postmaster has crashed. The nattch
variable of SysV shmem segments is really pretty critical to us, and
AFAIK there simply is no substitute for it in POSIX-land.

I've been looking and there really doesn't appear to be. This is
consistant as there is nothing else in POSIX where you can determine
how many other people have the same file, pipe, tty, etc open.

I asked a few people for ideas and got answers like: just walk through
/proc and check. Apart from the portability issues, this won't work if
there are different user-IDs in play.

The only real solution seems to me to be to keep a small SysV shared
memory segment for the locking and allocate the rest of the shared
memory some other way. If all backends map the SysV memory before the
other way, then you can use the non-existance of the SysV SHM to
determine the non-existance of the other segment.

Quite a bit more work, ISTM.

Haveva nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Martijn van Oosterhout (#3)

Re: POSIX shared memory redux

Martijn van Oosterhout <kleptog@svana.org> writes:

The only real solution seems to me to be to keep a small SysV shared
memory segment for the locking and allocate the rest of the shared
memory some other way.

Yeah, that's been discussed. It throws all the portability gains out
the window. It might get you out from under the need to readjust a
machine's SHMMAX setting before you can use a large amount of shared
memory, but it's not clear that's enough of a win to be worth the
trouble.

The other direction that we could possibly go is to find some other way
entirely of interlocking access to the data directory. If for example
we could rely on a file lock held by the postmaster and all backends,
we could check that instead of having to rely on a shmem behavior.
The killer objection to that so far is that file locking is unreliable
in some environments, particularly NFS. But it'd have some advantages
too --- in particular, in the NFS context, the fact that the lock is
visible to would-be postmasters on different machines might be thought
a huge safety improvement over what we do now.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#4)

Re: POSIX shared memory redux

On Sun, Nov 14, 2010 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

The only real solution seems to me to be to keep a small SysV shared
memory segment for the locking and allocate the rest of the shared
memory some other way.

Yeah, that's been discussed. It throws all the portability gains out
the window. It might get you out from under the need to readjust a
machine's SHMMAX setting before you can use a large amount of shared
memory, but it's not clear that's enough of a win to be worth the
trouble.

One of the things that would be really nice to be able to do is resize
our shm after startup, in response to changes in configuration
parameters. That's not so easy to make work, of course, but I feel
like this might be going in the right direction, since POSIX shms can
be resized using ftruncate().

The other direction that we could possibly go is to find some other way
entirely of interlocking access to the data directory. If for example
we could rely on a file lock held by the postmaster and all backends,
we could check that instead of having to rely on a shmem behavior.
The killer objection to that so far is that file locking is unreliable
in some environments, particularly NFS. But it'd have some advantages
too --- in particular, in the NFS context, the fact that the lock is
visible to would-be postmasters on different machines might be thought
a huge safety improvement over what we do now.

I've never had a lot of luck making filesystem locks work reliably,
but I don't discount the possibility that I was doing it wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

A.M.

agentm@themactionfaction.com

almost 15 years ago

In reply to: Robert Haas (#5)

1 attachment(s)

Re: POSIX shared memory redux

Hello,

Based on feedback from Tom Lane and Robert Haas, I have amended the POSIX shared memory patch to account for multiple-postmaster start race conditions (which is currently based on SysV shared memory checks).

https://github.com/agentm/postgres/tree/posix_shmem

Attachments:

posix_shmem.patchapplication/octet-stream; name=posix_shmem.patch; x-unix-mode=0644Download

*** a/configure
--- b/configure
***************
*** 850,855 **** with_gnu_ld
--- 850,856 ----
  enable_largefile
  enable_float4_byval
  enable_float8_byval
+ enable_float8_byval
  '
        ac_precious_vars='build_alias
  host_alias
***************
*** 860,865 **** LDFLAGS
--- 861,867 ----
  LIBS
  CPPFLAGS
  CPP
+ CPPFLAGS
  LDFLAGS_EX
  LDFLAGS_SL
  DOCBOOKSTYLE'
***************
*** 28245,28254 **** fi
  if test "$PORTNAME" != "win32"; then
  
  cat >>confdefs.h <<\_ACEOF
! #define USE_SYSV_SHARED_MEMORY 1
  _ACEOF
  
!   SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
  else
  
  cat >>confdefs.h <<\_ACEOF
--- 28247,28256 ----
  if test "$PORTNAME" != "win32"; then
  
  cat >>confdefs.h <<\_ACEOF
! #define USE_POSIX_SHARED_MEMORY 1
  _ACEOF
  
!   SHMEM_IMPLEMENTATION="src/backend/port/posix_shmem.c"
  else
  
  cat >>confdefs.h <<\_ACEOF
*** a/configure.in
--- b/configure.in
***************
*** 1730,1737 **** fi
  
  # Select shared-memory implementation type.
  if test "$PORTNAME" != "win32"; then
!   AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.])
!   SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
  else
    AC_DEFINE(USE_WIN32_SHARED_MEMORY, 1, [Define to select Win32-style shared memory.])
    SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
--- 1730,1737 ----
  
  # Select shared-memory implementation type.
  if test "$PORTNAME" != "win32"; then
!   AC_DEFINE(USE_POSIX_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.])
!   SHMEM_IMPLEMENTATION="src/backend/port/posix_shmem.c" 
  else
    AC_DEFINE(USE_WIN32_SHARED_MEMORY, 1, [Define to select Win32-style shared memory.])
    SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
*** a/src/backend/bootstrap/bootstrap.c
--- b/src/backend/bootstrap/bootstrap.c
***************
*** 352,358 **** AuxiliaryProcessMain(int argc, char *argv[])
  
  	/* If standalone, create lockfile for data directory */
  	if (!IsUnderPostmaster)
! 		CreateDataDirLockFile(false);
  
  	SetProcessingMode(BootstrapProcessing);
  	IgnoreSystemIndexes = true;
--- 352,361 ----
  
  	/* If standalone, create lockfile for data directory */
  	if (!IsUnderPostmaster)
!           CreateDataDirLockFile(false,false);
! 	
! 	/* Hold on to the lock file for the life of this process. */
! 	AcquireDataDirLock();
  
  	SetProcessingMode(BootstrapProcessing);
  	IgnoreSystemIndexes = true;
*** /dev/null
--- b/src/backend/port/posix_shmem.c
***************
*** 0 ****
--- 1,469 ----
+ /*-------------------------------------------------------------------------
+  *
+  * posix_shmem.c
+  *	  Implement shared memory using POSIX facilities
+  *
+  * These routines represent a fairly thin layer on top of POSIX shared
+  * memory functionality.
+  *
+  * Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include <signal.h>
+ #include <unistd.h>
+ #include <sys/file.h>
+ #include <sys/mman.h>
+ #include <sys/param.h>
+ #include <sys/stat.h>
+ #include <sys/types.h>
+ #ifdef HAVE_KERNEL_OS_H
+ #include <kernel/OS.h>
+ #endif
+ 
+ #include "miscadmin.h"
+ #include "libpq/md5.h"
+ #include "storage/ipc.h"
+ #include "storage/pg_shmem.h"
+ 
+ 
+ #define IPCProtection	(0600)	/* access/modify by user only */
+ #define IPCNameLength		31  /* Darwin requires max 30 + '\0' */
+ 
+ uint8	UsedShmemInstanceId = 0;
+ void	*UsedShmemSegAddr = NULL;
+ 
+ static void GenerateIPCName(uint8 instanceId, char destIPCName[IPCNameLength]);
+ static void *InternalIpcMemoryCreate(const char ipcName[IPCNameLength],uint8 instanceId, Size size);
+ static void IpcMemoryDetach(int status, Datum shmaddr);
+ static int POSIXSharedMemoryFD=-1;
+ 
+ 
+ /*
+  *	GenerateIPCName(instanceId, destIPCName)
+  *
+  * Generate a shared memory object key name using the implicit argument
+  * DataDir's pathname and the current instance id. A hash of the
+  * canonicalized directory path is used to construct the key name.
+  * Store the result in destIPCName, which must be IPCNameLength bytes.
+  */
+ static void
+ GenerateIPCName(uint8 instanceId, char destIPCName[IPCNameLength])
+ {
+   
+   /* This must be 30 characters or less for portability (i.e. Darwin).
+    * POSIX requires shared memory names to begin with a single slash. It 
+    * should not have any others slashes or any non-alphanumerics as the
+    * that is the broadest assumption of what is permitted in a filename.
+    * Also, case sensitivity should not be presumed.
+    *
+    * Collisions are averted by the fact that the shared memory region is 
+    * immediately unlinked.
+    * 
+    * The string is formed starting with a slash, then the identifier 'PG.',
+    * then the pid of the current process.
+    */
+   snprintf(destIPCName, IPCNameLength, "/PG.%6ld", (long int)getpid());
+ }
+ 
+ /*
+  *	InternalIpcMemoryCreate(ipcName, size)
+  *
+  * Attempt to create a new shared memory segment with the specified IPC name.
+  * Will fail (return NULL) if such a segment already exists.  If successful,
+  * attach the segment to the current process and return its attached address.
+  * On success, callbacks are registered with on_shmem_exit to detach and
+  * delete the segment when on_shmem_exit is called.
+  *
+  * If we fail with a failure code other than collision-with-existing-segment,
+  * print out an error and abort.  Other types of errors are not recoverable.
+  */
+ static void *
+ InternalIpcMemoryCreate(const char ipcName[IPCNameLength], uint8 instanceId, Size size)
+ {
+ 	int			fd;
+ 	int unlink_status=0;
+ 	int fstat_status=0;
+ 	int ftruncate_status=0;
+ 	void	   *shmaddr;
+ 	struct		stat statbuf;
+ 	
+ 	fd = shm_open(ipcName, O_RDWR | O_CREAT | O_EXCL, IPCProtection);
+ 
+ 	if (fd < 0)
+ 	{
+ 		/*
+ 		 * Fail quietly if error indicates a collision with existing segment.
+ 		 * One would expect EEXIST, given that we said O_EXCL.
+ 		 */
+ 		if (errno == EEXIST || errno == EACCES || errno == EINTR)
+ 			return NULL;
+ 
+ 		/*
+ 		 * Else complain and abort
+ 		 */
+ 		ereport(FATAL,
+ 				(errmsg("could not create shared memory segment: %m"),
+ 				 errdetail("Failed system call was shm_open(name=%s, oflag=%lu, mode=%lu).",
+ 						   ipcName, (unsigned long) O_CREAT | O_EXCL,
+ 						   (unsigned long) IPCProtection),
+ 				 (errno == EMFILE) ?
+ 				 errhint("This error means that the process has reached its limit "
+ 						 "for open file descriptors.") : 0,
+ 				 (errno == ENOSPC) ?
+ 				 errhint("This error means the process has ran out of address "
+ 						 "space.") : 0));
+ 	}
+ 	/* the race between creation and unlinking is protected by the shared memory pid file */
+ 
+ 	
+ 	unlink_status = shm_unlink(ipcName);
+ 	if(unlink_status<0)
+ 	  {
+ 	    /* It would be virtually impossible for us to fail to unlink a shared memory region we just created, but we need to handle this anyway- refuse to use this shared memory segment. */
+ 	    ereport(FATAL,
+ 		    (errmsg("could not unlink shared memory segment : %m"),
+ 		     errdetail("Failed system call was shm_unlink(name=%s).",ipcName)));
+ 	    return NULL;
+ 	  }
+ 	
+ 	/* Increase the size of the file descriptor to the desired length.
+ 	 * If this fails so will mmap since it can't map size bytes. */
+ 	fstat_status = fstat(fd, &statbuf);
+ 	if(fstat_status<0)
+ 	  {
+ 	    ereport(FATAL,
+ 		    (errmsg("could not fstat the shared memory segment : %m"),
+ 		     errdetail("Failed system call was fstat(fd=%d,stat=%p).",fd,&statbuf)));
+ 	    return NULL;
+ 	  }
+ 	if (statbuf.st_size < size)
+ 	  {
+ 	    ftruncate_status = ftruncate(fd, size);
+ 	    if(ftruncate_status<0)
+ 	      {
+ 		ereport(FATAL,
+ 			(errmsg("could not set the proper shared memory segment size : %m"),
+ 			 errdetail("Failed system call was ftruncate(fd=%d,size=%lu).",fd,size)));
+ 		  return NULL;
+ 	      }
+ 	  }
+ 	
+ 	/* OK, should be able to attach to the segment */
+ 	shmaddr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ 
+ 	if (shmaddr == MAP_FAILED)
+ 	  elog(FATAL, "mmap with size=%ul and fd=%d failed: %m", (unsigned int) size, fd);
+ 
+ 	/* Register on-exit routine to detach new segment before deleting */
+ 	on_shmem_exit(IpcMemoryDetach, PointerGetDatum(shmaddr));
+ 
+ 	POSIXSharedMemoryFD = fd;
+ 	return shmaddr;
+ }
+ 
+ /****************************************************************************/
+ /*	IpcMemoryDetach(status, shmaddr)	removes a shared memory segment		*/
+ /*										from process' address space		*/
+ /*	(called as an on_shmem_exit callback, hence funny argument list)		*/
+ /****************************************************************************/
+ static void
+ IpcMemoryDetach(int status, Datum shmaddr)
+ {
+ 	PGShmemHeader  *hdr;
+ 	hdr = (PGShmemHeader *) DatumGetPointer(shmaddr);
+ 	
+ 	if (munmap(DatumGetPointer(shmaddr), hdr->totalsize) < 0)
+ 		elog(LOG, "munmap(%p, ...) failed: %m", DatumGetPointer(shmaddr));
+ }
+ 
+ /*
+  * PGSharedMemoryIsInUse
+  *
+  * Is a previously-existing shmem segment still existing and in use?
+  *
+  * The point of this exercise is to detect the case where a prior postmaster
+  * crashed, but it left child backends that are still running.	Therefore
+  * we only care about shmem segments that are associated with the intended
+  * DataDir.  This is an important consideration since accidental matches of
+  * shmem segment IDs are reasonably common.
+  */
+ bool
+ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
+ {
+ 	char		ipcName[IPCNameLength];
+ 	PGShmemHeader  *hdr;
+ 	int			fd, isValidHeader;
+ 	
+ #ifndef WIN32
+ 	struct stat statbuf;
+ #endif
+ 	
+ 	/*
+ 	 * We detect whether a shared memory segment is in use by seeing whether
+ 	 * we can open it. If so, 
+ 	 */
+ 	GenerateIPCName((uint8) id1, ipcName);
+ 	fd = shm_open(ipcName, O_RDWR, 0);
+ 	if (fd < 0)
+ 	{
+ 		/*
+ 		 * ENOENT means the segment no longer exists.
+ 		 */
+ 		if (errno == ENOENT)
+ 			return false;
+ 
+ 		/*
+ 		 * EACCES implies that the segment belongs to some other userid, which
+ 		 * means that there is an different account with the same database open.
+ 		 */
+ 		if (errno == EACCES)
+ 			return true;
+ 	}
+ 
+ 	/*
+ 	 * Try to attach to the segment and see if it matches our data directory,
+ 	 * just as a sanity check. Note that this is not absolutely necessary
+ 	 * since the data directory is encoded in the IPC shared memory key name.
+ 	 * 
+ 	 * On Windows, which doesn't have useful inode numbers, we can't do this
+ 	 * so we punt and assume that the shared memory is valid (which in all
+ 	 * likelihood it is).
+ 	 */
+ #ifdef WIN32
+ 	close(fd);
+ 	return true;
+ #else
+ 	if (stat(DataDir, &statbuf) < 0)
+ 	{
+ 		close(fd);
+ 		return true;			/* if can't stat, be conservative */
+ 	}
+ 
+ 	hdr = (PGShmemHeader *) mmap(NULL, sizeof(PGShmemHeader), PROT_READ, MAP_SHARED, fd, 0);
+ 	close(fd);
+ 
+ 	if (hdr == (PGShmemHeader *) -1)
+ 		return true;			/* if can't attach, be conservative */
+ 
+ 	isValidHeader = hdr->magic == PGShmemMagic &&
+ 		hdr->device == statbuf.st_dev &&
+ 		hdr->inode == statbuf.st_ino;
+ 	munmap((void *) hdr, sizeof(PGShmemHeader));
+ 	
+ 	/*
+ 	 * If true, it's either not a Postgres segment, or not one for my data
+ 	 * directory.  In either case it poses no threat.
+ 	 * If false, trouble -- looks a lot like there are still live backends
+ 	 */
+ 	
+ 	return isValidHeader;
+ #endif
+ }
+ 
+ 
+ /*
+  * PGSharedMemoryCreate
+  *
+  * Create a shared memory segment of the given size and initialize its
+  * standard header.  Also, register an on_shmem_exit callback to release
+  * the storage.
+  *
+  * Dead Postgres segments are released when found, but we do not fail upon
+  * collision with non-Postgres shmem segments, although this is astronomically
+  * unlikely.
+  *
+  * makePrivate means to always create a new segment, rather than attach to
+  * or recycle any existing segment. Currently, this value is ignored as
+  * all segments are newly created (the dead ones are simply released).
+  *
+  * Port is ignored. (It is leftover from the SysV shared memory routines.)
+  */
+ PGShmemHeader *
+ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
+ {
+ 	uint8			instanceId;
+ 	void		   *shmaddr;
+ 	PGShmemHeader  *hdr;
+ 	char			ipcName[IPCNameLength];
+ 	
+ #ifndef WIN32
+ 	struct stat		statbuf;
+ #endif
+ 
+ 	/* Room for a header? */
+ 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+ 
+ 	/* Make sure PGSharedMemoryAttach doesn't fail without need */
+ 	UsedShmemSegAddr = NULL;
+ 
+ 	/* Loop till we find a free IPC key */
+ 	for (instanceId = 0; true; instanceId++)
+ 	{
+ 		/*
+ 		 * Try to create new segment. InternalIpcMemoryCreate encodes the data
+ 		 * directory path name into the IPC key name, so if this fails
+ 		 * one of three things has happened:
+ 		 * 1) there is another postmaster still running with the same data directory
+ 		 * 2) the postmaster in this directory crashed or was kill -9'd
+ 		 *		and there are backends still running.
+ 		 * 3) the postmaster in this directory crashed or was kill -9'd and there 
+ 		 *		are no backends still running, just an orpaned shmem segment
+ 		 *
+ 		 * Case 1 is handled by the postmaster.pid file and doesn't concern us here.
+ 		 * For case 2 & 3 we now should unlink the shmem segment so that it is
+ 		 * cleaned up, either now (case 3) or when the backends terminate (case 2). 
+ 		 * Then we should try the next instanceId to create a new segment so this
+ 		 * process can be up and running quickly.  
+ 		 */
+ 		GenerateIPCName(instanceId, ipcName);
+ 		shmaddr = InternalIpcMemoryCreate(ipcName, instanceId, size);
+ 		if (shmaddr)
+ 			break;				/* successful create and attach */
+ 
+ 		/*
+ 		 * The segment appears to be from a dead Postgres process, or from a
+ 		 * previous cycle of life in this same process.  Zap it, if possible.
+ 		 * This shouldn't fail, but if it does, assume the segment
+ 		 * belongs to someone else after all, and continue quietly.
+ 		 */
+ 		shm_unlink(ipcName);
+ 	}
+ 
+ 	/* OK, we created a new segment.  Mark it as created by this process. */
+ 	hdr = (PGShmemHeader *) shmaddr;
+ 	hdr->creatorPID = getpid();
+ 	hdr->magic = PGShmemMagic;
+ 
+ #ifndef WIN32
+ 	/* Fill in the data directory ID info, too */
+ 	if (stat(DataDir, &statbuf) < 0)
+ 		ereport(FATAL,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not stat data directory \"%s\": %m",
+ 						DataDir)));
+ 	hdr->device = statbuf.st_dev;
+ 	hdr->inode = statbuf.st_ino;
+ #endif
+ 
+ 	/* Initialize space allocation status for segment. */
+ 	hdr->totalsize = size;
+ 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
+ 
+ 	/* Save info for possible future use */
+ 	UsedShmemInstanceId = instanceId;
+ 	UsedShmemSegAddr = shmaddr;
+ 
+ 	return hdr;
+ }
+ 
+ #ifdef EXEC_BACKEND
+ 
+ /*
+  * PGSharedMemoryReAttach
+  *
+  * Re-attach to an already existing shared memory segment.	In the non
+  * EXEC_BACKEND case this is not used, because postmaster children inherit
+  * the shared memory segment attachment via fork().
+  *
+  * UsedShmemInstanceId and UsedShmemSegAddr are implicit parameters to this
+  * routine.  The caller must have already restored them to the postmaster's
+  * values.
+  */
+ void
+ PGSharedMemoryReAttach(void)
+ {
+ 	int		fd;
+ 	void   *hdr;
+ 	void   *origUsedShmemSegAddr = UsedShmemSegAddr;
+ 
+ 	Assert(UsedShmemSegAddr != NULL);
+ 	Assert(IsUnderPostmaster);
+ 
+ #ifdef __CYGWIN__
+ 	/* cygipc (currently) appears to not detach on exec. */
+ 	PGSharedMemoryDetach();
+ 	UsedShmemSegAddr = origUsedShmemSegAddr;
+ #endif
+ 
+ 	elog(DEBUG3, "attaching to %p", UsedShmemSegAddr);
+ 	hdr = (void *) PGSharedMemoryAttach(UsedShmemInstanceId);
+ 	if (hdr == NULL)
+ 		elog(FATAL, "could not reattach to shared memory (instanceId=%d, addr=%p): %m",
+ 			 (int) UsedShmemInstanceId, UsedShmemSegAddr);
+ 	if (hdr != origUsedShmemSegAddr)
+ 		elog(FATAL, "reattaching to shared memory returned unexpected address (got %p, expected %p)",
+ 			 hdr, origUsedShmemSegAddr);
+ 
+ 	UsedShmemSegAddr = hdr;		/* probably redundant */
+ }
+ 
+ 
+ /*
+  * Attach to shared memory and make sure it has a Postgres header
+  *
+  * Returns attach address if OK, else NULL
+  */
+ static PGShmemHeader *
+ PGSharedMemoryAttach(uint8 instanceId)
+ {
+ 	PGShmemHeader *hdr;
+ 	char		ipcName[IPCNameLength];
+ 	Size		size;
+ 	int fd;
+ 
+ 	fd = POSIXSharedMemoryFD;
+ 
+ 	if (fd < 0)
+ 		return NULL;
+ 
+ 	hdr = (PGShmemHeader *) mmap(UsedShmemSegAddr, sizeof(PGShmemHeader),
+ 								 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ 
+ 	if (hdr == MMAP_FAILED)
+ 	{
+ 		return NULL;			/* failed to mmap- unlikely */
+ 	}
+ 
+ 	if (hdr->magic != PGShmemMagic)
+ 	{
+ 		munmap((void *) hdr, sizeof(PGShmemHeader));
+ 		return NULL;			/* segment belongs to a non-Postgres app */
+ 	}
+ 	
+ 	/* Since the segment has a valid Postgres header, unmap and re-map it with the proper size */
+ 	size = hdr->totalsize;
+ 	munmap((void *) hdr, sizeof(PGShmemHeader));
+ 	hdr = (PGShmemHeader *) mmap(UsedShmemSegAddr, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ 	
+ 	if (hdr == MMAP_FAILED)   /* this shouldn't happen */
+ 		return NULL;
+ 	
+ 	return hdr;
+ }
+ #endif   /* EXEC_BACKEND */
+ 
+ /*
+  * PGSharedMemoryDetach
+  *
+  * Detach from the shared memory segment, if still attached.  This is not
+  * intended for use by the process that originally created the segment
+  * (it will have an on_shmem_exit callback registered to do that).	Rather,
+  * this is for subprocesses that have inherited an attachment and want to
+  * get rid of it.
+  */
+ void
+ PGSharedMemoryDetach(void)
+ {
+ 	PGShmemHeader  *hdr;
+ 	if (UsedShmemSegAddr != NULL)
+ 	{
+ 		hdr = (PGShmemHeader *) UsedShmemSegAddr;
+ 		if (munmap(UsedShmemSegAddr, hdr->totalsize) < 0)
+ 			elog(LOG, "munmap(%p) failed: %m", UsedShmemSegAddr);
+ 		UsedShmemSegAddr = NULL;
+ 	}
+ }
*** a/src/backend/port/sysv_shmem.c
--- /dev/null
***************
*** 1,550 ****
- /*-------------------------------------------------------------------------
-  *
-  * sysv_shmem.c
-  *	  Implement shared memory using SysV facilities
-  *
-  * These routines represent a fairly thin layer on top of SysV shared
-  * memory functionality.
-  *
-  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
-  * Portions Copyright (c) 1994, Regents of the University of California
-  *
-  * IDENTIFICATION
-  *	  src/backend/port/sysv_shmem.c
-  *
-  *-------------------------------------------------------------------------
-  */
- #include "postgres.h"
- 
- #include <signal.h>
- #include <unistd.h>
- #include <sys/file.h>
- #include <sys/stat.h>
- #ifdef HAVE_SYS_IPC_H
- #include <sys/ipc.h>
- #endif
- #ifdef HAVE_SYS_SHM_H
- #include <sys/shm.h>
- #endif
- #ifdef HAVE_KERNEL_OS_H
- #include <kernel/OS.h>
- #endif
- 
- #include "miscadmin.h"
- #include "storage/ipc.h"
- #include "storage/pg_shmem.h"
- 
- 
- typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
- typedef int IpcMemoryId;		/* shared memory ID returned by shmget(2) */
- 
- #define IPCProtection	(0600)	/* access/modify by user only */
- 
- #ifdef SHM_SHARE_MMU			/* use intimate shared memory on Solaris */
- #define PG_SHMAT_FLAGS			SHM_SHARE_MMU
- #else
- #define PG_SHMAT_FLAGS			0
- #endif
- 
- 
- unsigned long UsedShmemSegID = 0;
- void	   *UsedShmemSegAddr = NULL;
- 
- static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
- static void IpcMemoryDetach(int status, Datum shmaddr);
- static void IpcMemoryDelete(int status, Datum shmId);
- static PGShmemHeader *PGSharedMemoryAttach(IpcMemoryKey key,
- 					 IpcMemoryId *shmid);
- 
- 
- /*
-  *	InternalIpcMemoryCreate(memKey, size)
-  *
-  * Attempt to create a new shared memory segment with the specified key.
-  * Will fail (return NULL) if such a segment already exists.  If successful,
-  * attach the segment to the current process and return its attached address.
-  * On success, callbacks are registered with on_shmem_exit to detach and
-  * delete the segment when on_shmem_exit is called.
-  *
-  * If we fail with a failure code other than collision-with-existing-segment,
-  * print out an error and abort.  Other types of errors are not recoverable.
-  */
- static void *
- InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size)
- {
- 	IpcMemoryId shmid;
- 	void	   *memAddress;
- 
- 	shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection);
- 
- 	if (shmid < 0)
- 	{
- 		/*
- 		 * Fail quietly if error indicates a collision with existing segment.
- 		 * One would expect EEXIST, given that we said IPC_EXCL, but perhaps
- 		 * we could get a permission violation instead?  Also, EIDRM might
- 		 * occur if an old seg is slated for destruction but not gone yet.
- 		 */
- 		if (errno == EEXIST || errno == EACCES
- #ifdef EIDRM
- 			|| errno == EIDRM
- #endif
- 			)
- 			return NULL;
- 
- 		/*
- 		 * Some BSD-derived kernels are known to return EINVAL, not EEXIST, if
- 		 * there is an existing segment but it's smaller than "size" (this is
- 		 * a result of poorly-thought-out ordering of error tests). To
- 		 * distinguish between collision and invalid size in such cases, we
- 		 * make a second try with size = 0.  These kernels do not test size
- 		 * against SHMMIN in the preexisting-segment case, so we will not get
- 		 * EINVAL a second time if there is such a segment.
- 		 */
- 		if (errno == EINVAL)
- 		{
- 			int			save_errno = errno;
- 
- 			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);
- 
- 			if (shmid < 0)
- 			{
- 				/* As above, fail quietly if we verify a collision */
- 				if (errno == EEXIST || errno == EACCES
- #ifdef EIDRM
- 					|| errno == EIDRM
- #endif
- 					)
- 					return NULL;
- 				/* Otherwise, fall through to report the original error */
- 			}
- 			else
- 			{
- 				/*
- 				 * On most platforms we cannot get here because SHMMIN is
- 				 * greater than zero.  However, if we do succeed in creating a
- 				 * zero-size segment, free it and then fall through to report
- 				 * the original error.
- 				 */
- 				if (shmctl(shmid, IPC_RMID, NULL) < 0)
- 					elog(LOG, "shmctl(%d, %d, 0) failed: %m",
- 						 (int) shmid, IPC_RMID);
- 			}
- 
- 			errno = save_errno;
- 		}
- 
- 		/*
- 		 * Else complain and abort.
- 		 *
- 		 * Note: at this point EINVAL should mean that either SHMMIN or SHMMAX
- 		 * is violated.  SHMALL violation might be reported as either ENOMEM
- 		 * (BSDen) or ENOSPC (Linux); the Single Unix Spec fails to say which
- 		 * it should be.  SHMMNI violation is ENOSPC, per spec.  Just plain
- 		 * not-enough-RAM is ENOMEM.
- 		 */
- 		ereport(FATAL,
- 				(errmsg("could not create shared memory segment: %m"),
- 		  errdetail("Failed system call was shmget(key=%lu, size=%lu, 0%o).",
- 					(unsigned long) memKey, (unsigned long) size,
- 					IPC_CREAT | IPC_EXCL | IPCProtection),
- 				 (errno == EINVAL) ?
- 				 errhint("This error usually means that PostgreSQL's request for a shared memory "
- 		  "segment exceeded your kernel's SHMMAX parameter.  You can either "
- 						 "reduce the request size or reconfigure the kernel with larger SHMMAX.  "
- 				  "To reduce the request size (currently %lu bytes), reduce "
- 					   "PostgreSQL's shared memory usage, perhaps by reducing shared_buffers "
- 						 "or max_connections.\n"
- 						 "If the request size is already small, it's possible that it is less than "
- 						 "your kernel's SHMMIN parameter, in which case raising the request size or "
- 						 "reconfiguring SHMMIN is called for.\n"
- 		"The PostgreSQL documentation contains more information about shared "
- 						 "memory configuration.",
- 						 (unsigned long) size) : 0,
- 				 (errno == ENOMEM) ?
- 				 errhint("This error usually means that PostgreSQL's request for a shared "
- 				   "memory segment exceeded available memory or swap space, "
- 						 "or exceeded your kernel's SHMALL parameter.  You can either "
- 						 "reduce the request size or reconfigure the kernel with larger SHMALL.  "
- 				  "To reduce the request size (currently %lu bytes), reduce "
- 					   "PostgreSQL's shared memory usage, perhaps by reducing shared_buffers "
- 						 "or max_connections.\n"
- 		"The PostgreSQL documentation contains more information about shared "
- 						 "memory configuration.",
- 						 (unsigned long) size) : 0,
- 				 (errno == ENOSPC) ?
- 				 errhint("This error does *not* mean that you have run out of disk space. "
- 						 "It occurs either if all available shared memory IDs have been taken, "
- 						 "in which case you need to raise the SHMMNI parameter in your kernel, "
- 		  "or because the system's overall limit for shared memory has been "
- 				 "reached.  If you cannot increase the shared memory limit, "
- 		  "reduce PostgreSQL's shared memory request (currently %lu bytes), "
- 				   "perhaps by reducing shared_buffers or max_connections.\n"
- 		"The PostgreSQL documentation contains more information about shared "
- 						 "memory configuration.",
- 						 (unsigned long) size) : 0));
- 	}
- 
- 	/* Register on-exit routine to delete the new segment */
- 	on_shmem_exit(IpcMemoryDelete, Int32GetDatum(shmid));
- 
- 	/* OK, should be able to attach to the segment */
- 	memAddress = shmat(shmid, NULL, PG_SHMAT_FLAGS);
- 
- 	if (memAddress == (void *) -1)
- 		elog(FATAL, "shmat(id=%d) failed: %m", shmid);
- 
- 	/* Register on-exit routine to detach new segment before deleting */
- 	on_shmem_exit(IpcMemoryDetach, PointerGetDatum(memAddress));
- 
- 	/*
- 	 * Store shmem key and ID in data directory lockfile.  Format to try to
- 	 * keep it the same length always (trailing junk in the lockfile won't
- 	 * hurt, but might confuse humans).
- 	 */
- 	{
- 		char line[64];
- 
- 		sprintf(line, "%9lu %9lu",
- 				(unsigned long) memKey, (unsigned long) shmid);
- 		AddToDataDirLockFile(LOCK_FILE_LINE_SHMEM_KEY, line);
- 	}
- 
- 	return memAddress;
- }
- 
- /****************************************************************************/
- /*	IpcMemoryDetach(status, shmaddr)	removes a shared memory segment		*/
- /*										from process' address spaceq		*/
- /*	(called as an on_shmem_exit callback, hence funny argument list)		*/
- /****************************************************************************/
- static void
- IpcMemoryDetach(int status, Datum shmaddr)
- {
- 	if (shmdt(DatumGetPointer(shmaddr)) < 0)
- 		elog(LOG, "shmdt(%p) failed: %m", DatumGetPointer(shmaddr));
- }
- 
- /****************************************************************************/
- /*	IpcMemoryDelete(status, shmId)		deletes a shared memory segment		*/
- /*	(called as an on_shmem_exit callback, hence funny argument list)		*/
- /****************************************************************************/
- static void
- IpcMemoryDelete(int status, Datum shmId)
- {
- 	if (shmctl(DatumGetInt32(shmId), IPC_RMID, NULL) < 0)
- 		elog(LOG, "shmctl(%d, %d, 0) failed: %m",
- 			 DatumGetInt32(shmId), IPC_RMID);
- }
- 
- /*
-  * PGSharedMemoryIsInUse
-  *
-  * Is a previously-existing shmem segment still existing and in use?
-  *
-  * The point of this exercise is to detect the case where a prior postmaster
-  * crashed, but it left child backends that are still running.	Therefore
-  * we only care about shmem segments that are associated with the intended
-  * DataDir.  This is an important consideration since accidental matches of
-  * shmem segment IDs are reasonably common.
-  */
- bool
- PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
- {
- 	IpcMemoryId shmId = (IpcMemoryId) id2;
- 	struct shmid_ds shmStat;
- 	struct stat statbuf;
- 	PGShmemHeader *hdr;
- 
- 	/*
- 	 * We detect whether a shared memory segment is in use by seeing whether
- 	 * it (a) exists and (b) has any processes attached to it.
- 	 */
- 	if (shmctl(shmId, IPC_STAT, &shmStat) < 0)
- 	{
- 		/*
- 		 * EINVAL actually has multiple possible causes documented in the
- 		 * shmctl man page, but we assume it must mean the segment no longer
- 		 * exists.
- 		 */
- 		if (errno == EINVAL)
- 			return false;
- 
- 		/*
- 		 * EACCES implies that the segment belongs to some other userid, which
- 		 * means it is not a Postgres shmem segment (or at least, not one that
- 		 * is relevant to our data directory).
- 		 */
- 		if (errno == EACCES)
- 			return false;
- 
- 		/*
- 		 * Some Linux kernel versions (in fact, all of them as of July 2007)
- 		 * sometimes return EIDRM when EINVAL is correct.  The Linux kernel
- 		 * actually does not have any internal state that would justify
- 		 * returning EIDRM, so we can get away with assuming that EIDRM is
- 		 * equivalent to EINVAL on that platform.
- 		 */
- #ifdef HAVE_LINUX_EIDRM_BUG
- 		if (errno == EIDRM)
- 			return false;
- #endif
- 
- 		/*
- 		 * Otherwise, we had better assume that the segment is in use. The
- 		 * only likely case is EIDRM, which implies that the segment has been
- 		 * IPC_RMID'd but there are still processes attached to it.
- 		 */
- 		return true;
- 	}
- 
- 	/* If it has no attached processes, it's not in use */
- 	if (shmStat.shm_nattch == 0)
- 		return false;
- 
- 	/*
- 	 * Try to attach to the segment and see if it matches our data directory.
- 	 * This avoids shmid-conflict problems on machines that are running
- 	 * several postmasters under the same userid.
- 	 */
- 	if (stat(DataDir, &statbuf) < 0)
- 		return true;			/* if can't stat, be conservative */
- 
- 	hdr = (PGShmemHeader *) shmat(shmId, NULL, PG_SHMAT_FLAGS);
- 
- 	if (hdr == (PGShmemHeader *) -1)
- 		return true;			/* if can't attach, be conservative */
- 
- 	if (hdr->magic != PGShmemMagic ||
- 		hdr->device != statbuf.st_dev ||
- 		hdr->inode != statbuf.st_ino)
- 	{
- 		/*
- 		 * It's either not a Postgres segment, or not one for my data
- 		 * directory.  In either case it poses no threat.
- 		 */
- 		shmdt((void *) hdr);
- 		return false;
- 	}
- 
- 	/* Trouble --- looks a lot like there's still live backends */
- 	shmdt((void *) hdr);
- 
- 	return true;
- }
- 
- 
- /*
-  * PGSharedMemoryCreate
-  *
-  * Create a shared memory segment of the given size and initialize its
-  * standard header.  Also, register an on_shmem_exit callback to release
-  * the storage.
-  *
-  * Dead Postgres segments are recycled if found, but we do not fail upon
-  * collision with non-Postgres shmem segments.	The idea here is to detect and
-  * re-use keys that may have been assigned by a crashed postmaster or backend.
-  *
-  * makePrivate means to always create a new segment, rather than attach to
-  * or recycle any existing segment.
-  *
-  * The port number is passed for possible use as a key (for SysV, we use
-  * it to generate the starting shmem key).	In a standalone backend,
-  * zero will be passed.
-  */
- PGShmemHeader *
- PGSharedMemoryCreate(Size size, bool makePrivate, int port)
- {
- 	IpcMemoryKey NextShmemSegID;
- 	void	   *memAddress;
- 	PGShmemHeader *hdr;
- 	IpcMemoryId shmid;
- 	struct stat statbuf;
- 
- 	/* Room for a header? */
- 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
- 
- 	/* Make sure PGSharedMemoryAttach doesn't fail without need */
- 	UsedShmemSegAddr = NULL;
- 
- 	/* Loop till we find a free IPC key */
- 	NextShmemSegID = port * 1000;
- 
- 	for (NextShmemSegID++;; NextShmemSegID++)
- 	{
- 		/* Try to create new segment */
- 		memAddress = InternalIpcMemoryCreate(NextShmemSegID, size);
- 		if (memAddress)
- 			break;				/* successful create and attach */
- 
- 		/* Check shared memory and possibly remove and recreate */
- 
- 		if (makePrivate)		/* a standalone backend shouldn't do this */
- 			continue;
- 
- 		if ((memAddress = PGSharedMemoryAttach(NextShmemSegID, &shmid)) == NULL)
- 			continue;			/* can't attach, not one of mine */
- 
- 		/*
- 		 * If I am not the creator and it belongs to an extant process,
- 		 * continue.
- 		 */
- 		hdr = (PGShmemHeader *) memAddress;
- 		if (hdr->creatorPID != getpid())
- 		{
- 			if (kill(hdr->creatorPID, 0) == 0 || errno != ESRCH)
- 			{
- 				shmdt(memAddress);
- 				continue;		/* segment belongs to a live process */
- 			}
- 		}
- 
- 		/*
- 		 * The segment appears to be from a dead Postgres process, or from a
- 		 * previous cycle of life in this same process.  Zap it, if possible.
- 		 * This probably shouldn't fail, but if it does, assume the segment
- 		 * belongs to someone else after all, and continue quietly.
- 		 */
- 		shmdt(memAddress);
- 		if (shmctl(shmid, IPC_RMID, NULL) < 0)
- 			continue;
- 
- 		/*
- 		 * Now try again to create the segment.
- 		 */
- 		memAddress = InternalIpcMemoryCreate(NextShmemSegID, size);
- 		if (memAddress)
- 			break;				/* successful create and attach */
- 
- 		/*
- 		 * Can only get here if some other process managed to create the same
- 		 * shmem key before we did.  Let him have that one, loop around to try
- 		 * next key.
- 		 */
- 	}
- 
- 	/*
- 	 * OK, we created a new segment.  Mark it as created by this process. The
- 	 * order of assignments here is critical so that another Postgres process
- 	 * can't see the header as valid but belonging to an invalid PID!
- 	 */
- 	hdr = (PGShmemHeader *) memAddress;
- 	hdr->creatorPID = getpid();
- 	hdr->magic = PGShmemMagic;
- 
- 	/* Fill in the data directory ID info, too */
- 	if (stat(DataDir, &statbuf) < 0)
- 		ereport(FATAL,
- 				(errcode_for_file_access(),
- 				 errmsg("could not stat data directory \"%s\": %m",
- 						DataDir)));
- 	hdr->device = statbuf.st_dev;
- 	hdr->inode = statbuf.st_ino;
- 
- 	/*
- 	 * Initialize space allocation status for segment.
- 	 */
- 	hdr->totalsize = size;
- 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
- 
- 	/* Save info for possible future use */
- 	UsedShmemSegAddr = memAddress;
- 	UsedShmemSegID = (unsigned long) NextShmemSegID;
- 
- 	return hdr;
- }
- 
- #ifdef EXEC_BACKEND
- 
- /*
-  * PGSharedMemoryReAttach
-  *
-  * Re-attach to an already existing shared memory segment.	In the non
-  * EXEC_BACKEND case this is not used, because postmaster children inherit
-  * the shared memory segment attachment via fork().
-  *
-  * UsedShmemSegID and UsedShmemSegAddr are implicit parameters to this
-  * routine.  The caller must have already restored them to the postmaster's
-  * values.
-  */
- void
- PGSharedMemoryReAttach(void)
- {
- 	IpcMemoryId shmid;
- 	void	   *hdr;
- 	void	   *origUsedShmemSegAddr = UsedShmemSegAddr;
- 
- 	Assert(UsedShmemSegAddr != NULL);
- 	Assert(IsUnderPostmaster);
- 
- #ifdef __CYGWIN__
- 	/* cygipc (currently) appears to not detach on exec. */
- 	PGSharedMemoryDetach();
- 	UsedShmemSegAddr = origUsedShmemSegAddr;
- #endif
- 
- 	elog(DEBUG3, "attaching to %p", UsedShmemSegAddr);
- 	hdr = (void *) PGSharedMemoryAttach((IpcMemoryKey) UsedShmemSegID, &shmid);
- 	if (hdr == NULL)
- 		elog(FATAL, "could not reattach to shared memory (key=%d, addr=%p): %m",
- 			 (int) UsedShmemSegID, UsedShmemSegAddr);
- 	if (hdr != origUsedShmemSegAddr)
- 		elog(FATAL, "reattaching to shared memory returned unexpected address (got %p, expected %p)",
- 			 hdr, origUsedShmemSegAddr);
- 
- 	UsedShmemSegAddr = hdr;		/* probably redundant */
- }
- #endif   /* EXEC_BACKEND */
- 
- /*
-  * PGSharedMemoryDetach
-  *
-  * Detach from the shared memory segment, if still attached.  This is not
-  * intended for use by the process that originally created the segment
-  * (it will have an on_shmem_exit callback registered to do that).	Rather,
-  * this is for subprocesses that have inherited an attachment and want to
-  * get rid of it.
-  */
- void
- PGSharedMemoryDetach(void)
- {
- 	if (UsedShmemSegAddr != NULL)
- 	{
- 		if ((shmdt(UsedShmemSegAddr) < 0)
- #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
- 		/* Work-around for cygipc exec bug */
- 			&& shmdt(NULL) < 0
- #endif
- 			)
- 			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
- 		UsedShmemSegAddr = NULL;
- 	}
- }
- 
- 
- /*
-  * Attach to shared memory and make sure it has a Postgres header
-  *
-  * Returns attach address if OK, else NULL
-  */
- static PGShmemHeader *
- PGSharedMemoryAttach(IpcMemoryKey key, IpcMemoryId *shmid)
- {
- 	PGShmemHeader *hdr;
- 
- 	if ((*shmid = shmget(key, sizeof(PGShmemHeader), 0)) < 0)
- 		return NULL;
- 
- 	hdr = (PGShmemHeader *) shmat(*shmid, UsedShmemSegAddr, PG_SHMAT_FLAGS);
- 
- 	if (hdr == (PGShmemHeader *) -1)
- 		return NULL;			/* failed: must be some other app's */
- 
- 	if (hdr->magic != PGShmemMagic)
- 	{
- 		shmdt((void *) hdr);
- 		return NULL;			/* segment belongs to a non-Postgres app */
- 	}
- 
- 	return hdr;
- }
--- 0 ----
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 368,373 **** StartAutoVacLauncher(void)
--- 368,376 ----
  			/* Lose the postmaster's on-exit routines */
  			on_exit_reset();
  
+                         /* Hold on to the data directory lock until this process dies. */
+                         AcquireDataDirLock();
+ 
  			AutoVacLauncherMain(0, NULL);
  			break;
  #endif
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 632,637 **** pgstat_start(void)
--- 632,640 ----
  			/* Lose the postmaster's on-exit routines */
  			on_exit_reset();
  
+ 			/* Hold on to the data directory lock for all long as we live.*/
+ 			AcquireDataDirLock();
+ 			
  			/* Drop our connection to postmaster's shared memory, as well */
  			PGSharedMemoryDetach();
  
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 484,489 **** PostmasterMain(int argc, char *argv[])
--- 484,490 ----
  	char	   *userDoption = NULL;
  	bool		listen_addr_saved = false;
  	int			i;
+         bool blockOnStartupLockOption = false;
  
  	MyProcPid = PostmasterPid = getpid();
  
***************
*** 529,535 **** PostmasterMain(int argc, char *argv[])
  	 * tcop/postgres.c (the option sets should not conflict) and with the
  	 * common help() function in main/main.c.
  	 */
! 	while ((opt = getopt(argc, argv, "A:B:c:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:sTt:W:-:")) != -1)
  	{
  		switch (opt)
  		{
--- 530,536 ----
  	 * tcop/postgres.c (the option sets should not conflict) and with the
  	 * common help() function in main/main.c.
  	 */
! 	while ((opt = getopt(argc, argv, "A:bB:c:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:sTt:W:-:")) != -1)
  	{
  		switch (opt)
  		{
***************
*** 537,542 **** PostmasterMain(int argc, char *argv[])
--- 538,546 ----
  				SetConfigOption("debug_assertions", optarg, PGC_POSTMASTER, PGC_S_ARGV);
  				break;
  
+                         case 'b':
+                                blockOnStartupLockOption = true;
+                                 break;
  			case 'B':
  				SetConfigOption("shared_buffers", optarg, PGC_POSTMASTER, PGC_S_ARGV);
  				break;
***************
*** 790,796 **** PostmasterMain(int argc, char *argv[])
  	 * For the same reason, it's best to grab the TCP socket(s) before the
  	 * Unix socket.
  	 */
! 	CreateDataDirLockFile(true);
  
  	/*
  	 * If timezone is not set, determine what the OS uses.	(In theory this
--- 794,800 ----
  	 * For the same reason, it's best to grab the TCP socket(s) before the
  	 * Unix socket.
  	 */
! 	CreateDataDirLockFile(true,blockOnStartupLockOption);
  
  	/*
  	 * If timezone is not set, determine what the OS uses.	(In theory this
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 3600,3606 **** PostgresMain(int argc, char *argv[], const char *username)
  		/*
  		 * Create lockfile for data directory.
  		 */
! 		CreateDataDirLockFile(false);
  	}
  
  	/* Early initialization */
--- 3600,3606 ----
  		/*
  		 * Create lockfile for data directory.
  		 */
! 		CreateDataDirLockFile(false,false);
  	}
  
  	/* Early initialization */
***************
*** 3618,3623 **** PostgresMain(int argc, char *argv[], const char *username)
--- 3618,3633 ----
  #else
  	InitProcess();
  #endif
+         if(IsUnderPostmaster)
+           {
+             /* acquire the lock file advisory lock (to eliminate multiple-postmaster race conditions) 
+              * postgresql backends (postmaster children) must acquire the read lock to signify that there are backends operating in the specific data directory
+              * this needs to be done after InitProcess because the function needs access to the shared memory proc array
+              */
+             AcquireDataDirLock();        
+           }
+ 
+ 
  
  	/* We need to allow SIGINT, etc during the initial transaction */
  	PG_SETMASK(&UnBlockSig);
*** a/src/backend/utils/init/miscinit.c
--- b/src/backend/utils/init/miscinit.c
***************
*** 44,57 ****
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
  
  #define DIRECTORY_LOCK_FILE		"postmaster.pid"
  
  ProcessingMode Mode = InitProcessing;
  
- /* Note: we rely on this to initialize as zeroes */
- static char socketLockFile[MAXPGPATH];
- 
  
  /* ----------------------------------------------------------------
   *		ignoring system indexes support stuff
--- 44,75 ----
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
+ /* Note: we rely on this to initialize as zeroes */
+ static char socketLockFile[MAXPGPATH];
  
  #define DIRECTORY_LOCK_FILE		"postmaster.pid"
+ static pid_t GetPIDHoldingLock(int fileDescriptor,bool exclusiveLockFlag);
+ static int AcquireLock(int fileDescriptor,bool exclusiveLockFlag,bool waitForLock);
+ static int ReleaseLock(int fileDescriptor);
+ void AcquireDataDirLock();
+ pid_t GetPIDHoldingDataDirLock();
+ static void WriteLockFileContents(char *lockFilePath,int lockFileFD,bool isPostmasterFlag,pid_t processPid,char *dataDirectoryPath,long startTime,int portNumber,char * socketDirectory);
+ 
+ /* enum used by CreateLockFile to report its success or error condition */
+ typedef enum
+   {
+     /* positive numbers refer to the PID of a conflicting lock-holding process, but not necessarily a postmaster */
+     CreateLockFileNoError=0, 
+     CreateLockFileFileError=-1, /*advises the caller to check the errno for the error */
+     CreateLockFileSharedLockAcquisitionError=-2,
+     CreateLockFileExclusiveLockCheckError=-3,
+   } CreateLockFileValue;
+ 
+ 
+ static CreateLockFileValue CreateLockFile(char *lockFilePath,bool amPostmaster,int *lockFileRetDescriptor,bool blockOnLockFlag);
  
  ProcessingMode Mode = InitProcessing;
  
  
  /* ----------------------------------------------------------------
   *		ignoring system indexes support stuff
***************
*** 627,635 **** GetUserNameFromId(Oid roleid)
  /*-------------------------------------------------------------------------
   *				Interlock-file support
   *
!  * These routines are used to create both a data-directory lockfile
!  * ($DATADIR/postmaster.pid) and a Unix-socket-file lockfile ($SOCKFILE.lock).
!  * Both kinds of files contain the same info:
   *
   *		Owning process' PID
   *		Data directory path
--- 645,653 ----
  /*-------------------------------------------------------------------------
   *				Interlock-file support
   *
!  * These routines are used to create a data-directory lockfile
!  * ($DATADIR/postmaster.pid).
!  * The file contains the info:
   *
   *		Owning process' PID
   *		Data directory path
***************
*** 642,963 **** GetUserNameFromId(Oid roleid)
   * A data-directory lockfile can optionally contain a third line, containing
   * the key and ID for the shared memory block used by this postmaster.
   *
-  * On successful lockfile creation, a proc_exit callback to remove the
-  * lockfile is automatically created.
   *-------------------------------------------------------------------------
   */
  
! /*
!  * proc_exit callback to remove a lockfile.
!  */
! static void
! UnlinkLockFile(int status, Datum filename)
! {
! 	char	   *fname = (char *) DatumGetPointer(filename);
! 
! 	if (fname != NULL)
! 	{
! 		if (unlink(fname) != 0)
! 		{
! 			/* Should we complain if the unlink fails? */
! 		}
! 		free(fname);
! 	}
! }
  
  /*
!  * Create a lockfile.
   *
!  * filename is the name of the lockfile to create.
!  * amPostmaster is used to determine how to encode the output PID.
!  * isDDLock and refName are used to determine what error message to produce.
   */
! static void
! CreateLockFile(const char *filename, bool amPostmaster,
! 			   bool isDDLock, const char *refName)
  {
! 	int			fd;
! 	char		buffer[MAXPGPATH * 2 + 256];
! 	int			ntries;
! 	int			len;
! 	int			encoded_pid;
! 	pid_t		other_pid;
! 	pid_t		my_pid,
! 				my_p_pid,
! 				my_gp_pid;
! 	const char *envvar;
! 
! 	/*
! 	 * If the PID in the lockfile is our own PID or our parent's or
! 	 * grandparent's PID, then the file must be stale (probably left over from
! 	 * a previous system boot cycle).  We need to check this because of the
! 	 * likelihood that a reboot will assign exactly the same PID as we had in
! 	 * the previous reboot, or one that's only one or two counts larger and
! 	 * hence the lockfile's PID now refers to an ancestor shell process.  We
! 	 * allow pg_ctl to pass down its parent shell PID (our grandparent PID)
! 	 * via the environment variable PG_GRANDPARENT_PID; this is so that
! 	 * launching the postmaster via pg_ctl can be just as reliable as
! 	 * launching it directly.  There is no provision for detecting
! 	 * further-removed ancestor processes, but if the init script is written
! 	 * carefully then all but the immediate parent shell will be root-owned
! 	 * processes and so the kill test will fail with EPERM.  Note that we
! 	 * cannot get a false negative this way, because an existing postmaster
! 	 * would surely never launch a competing postmaster or pg_ctl process
! 	 * directly.
! 	 */
! 	my_pid = getpid();
! 
! #ifndef WIN32
! 	my_p_pid = getppid();
! #else
! 
! 	/*
! 	 * Windows hasn't got getppid(), but doesn't need it since it's not using
! 	 * real kill() either...
! 	 */
! 	my_p_pid = 0;
! #endif
! 
! 	envvar = getenv("PG_GRANDPARENT_PID");
! 	if (envvar)
! 		my_gp_pid = atoi(envvar);
! 	else
! 		my_gp_pid = 0;
! 
! 	/*
! 	 * We need a loop here because of race conditions.	But don't loop forever
! 	 * (for example, a non-writable $PGDATA directory might cause a failure
! 	 * that won't go away).  100 tries seems like plenty.
! 	 */
! 	for (ntries = 0;; ntries++)
! 	{
! 		/*
! 		 * Try to create the lock file --- O_EXCL makes this atomic.
! 		 *
! 		 * Think not to make the file protection weaker than 0600.	See
! 		 * comments below.
! 		 */
! 		fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
! 		if (fd >= 0)
! 			break;				/* Success; exit the retry loop */
! 
! 		/*
! 		 * Couldn't create the pid file. Probably it already exists.
! 		 */
! 		if ((errno != EEXIST && errno != EACCES) || ntries > 100)
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not create lock file \"%s\": %m",
! 							filename)));
! 
! 		/*
! 		 * Read the file to get the old owner's PID.  Note race condition
! 		 * here: file might have been deleted since we tried to create it.
! 		 */
! 		fd = open(filename, O_RDONLY, 0600);
! 		if (fd < 0)
! 		{
! 			if (errno == ENOENT)
! 				continue;		/* race condition; try again */
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not open lock file \"%s\": %m",
! 							filename)));
! 		}
! 		if ((len = read(fd, buffer, sizeof(buffer) - 1)) < 0)
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not read lock file \"%s\": %m",
! 							filename)));
! 		close(fd);
! 
! 		buffer[len] = '\0';
! 		encoded_pid = atoi(buffer);
! 
! 		/* if pid < 0, the pid is for postgres, not postmaster */
! 		other_pid = (pid_t) (encoded_pid < 0 ? -encoded_pid : encoded_pid);
! 
! 		if (other_pid <= 0)
! 			elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",
! 				 filename, buffer);
! 
! 		/*
! 		 * Check to see if the other process still exists
! 		 *
! 		 * Per discussion above, my_pid, my_p_pid, and my_gp_pid can be
! 		 * ignored as false matches.
! 		 *
! 		 * Normally kill() will fail with ESRCH if the given PID doesn't
! 		 * exist.
! 		 *
! 		 * We can treat the EPERM-error case as okay because that error
! 		 * implies that the existing process has a different userid than we
! 		 * do, which means it cannot be a competing postmaster.  A postmaster
! 		 * cannot successfully attach to a data directory owned by a userid
! 		 * other than its own.	(This is now checked directly in
! 		 * checkDataDir(), but has been true for a long time because of the
! 		 * restriction that the data directory isn't group- or
! 		 * world-accessible.)  Also, since we create the lockfiles mode 600,
! 		 * we'd have failed above if the lockfile belonged to another userid
! 		 * --- which means that whatever process kill() is reporting about
! 		 * isn't the one that made the lockfile.  (NOTE: this last
! 		 * consideration is the only one that keeps us from blowing away a
! 		 * Unix socket file belonging to an instance of Postgres being run by
! 		 * someone else, at least on machines where /tmp hasn't got a
! 		 * stickybit.)
! 		 */
! 		if (other_pid != my_pid && other_pid != my_p_pid &&
! 			other_pid != my_gp_pid)
! 		{
! 			if (kill(other_pid, 0) == 0 ||
! 				(errno != ESRCH && errno != EPERM))
! 			{
! 				/* lockfile belongs to a live process */
! 				ereport(FATAL,
! 						(errcode(ERRCODE_LOCK_FILE_EXISTS),
! 						 errmsg("lock file \"%s\" already exists",
! 								filename),
! 						 isDDLock ?
! 						 (encoded_pid < 0 ?
! 						  errhint("Is another postgres (PID %d) running in data directory \"%s\"?",
! 								  (int) other_pid, refName) :
! 						  errhint("Is another postmaster (PID %d) running in data directory \"%s\"?",
! 								  (int) other_pid, refName)) :
! 						 (encoded_pid < 0 ?
! 						  errhint("Is another postgres (PID %d) using socket file \"%s\"?",
! 								  (int) other_pid, refName) :
! 						  errhint("Is another postmaster (PID %d) using socket file \"%s\"?",
! 								  (int) other_pid, refName))));
! 			}
! 		}
! 
! 		/*
! 		 * No, the creating process did not exist.	However, it could be that
! 		 * the postmaster crashed (or more likely was kill -9'd by a clueless
! 		 * admin) but has left orphan backends behind.	Check for this by
! 		 * looking to see if there is an associated shmem segment that is
! 		 * still in use.
! 		 *
! 		 * Note: because postmaster.pid is written in multiple steps, we might
! 		 * not find the shmem ID values in it; we can't treat that as an
! 		 * error.
! 		 */
! 		if (isDDLock)
! 		{
! 			char	   *ptr = buffer;
! 			unsigned long id1,
! 						id2;
! 			int			lineno;
! 
! 			for (lineno = 1; lineno < LOCK_FILE_LINE_SHMEM_KEY; lineno++)
! 			{
! 				if ((ptr = strchr(ptr, '\n')) == NULL)
! 					break;
! 				ptr++;
! 			}
! 
! 			if (ptr != NULL &&
! 				sscanf(ptr, "%lu %lu", &id1, &id2) == 2)
! 			{
! 				if (PGSharedMemoryIsInUse(id1, id2))
! 					ereport(FATAL,
! 							(errcode(ERRCODE_LOCK_FILE_EXISTS),
! 							 errmsg("pre-existing shared memory block "
! 									"(key %lu, ID %lu) is still in use",
! 									id1, id2),
! 							 errhint("If you're sure there are no old "
! 								"server processes still running, remove "
! 									 "the shared memory block "
! 									 "or just delete the file \"%s\".",
! 									 filename)));
! 			}
! 		}
! 
! 		/*
! 		 * Looks like nobody's home.  Unlink the file and try again to create
! 		 * it.	Need a loop because of possible race condition against other
! 		 * would-be creators.
! 		 */
! 		if (unlink(filename) < 0)
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not remove old lock file \"%s\": %m",
! 							filename),
! 					 errhint("The file seems accidentally left over, but "
! 						   "it could not be removed. Please remove the file "
! 							 "by hand and try again.")));
! 	}
  
! 	/*
! 	 * Successfully created the file, now fill it.  See comment in miscadmin.h
! 	 * about the contents.  Note that we write the same info into both datadir
! 	 * and socket lockfiles; although more stuff may get added to the datadir
! 	 * lockfile later.
! 	 */
! 	snprintf(buffer, sizeof(buffer), "%d\n%s\n%ld\n%d\n%s\n",
! 			 amPostmaster ? (int) my_pid : -((int) my_pid),
! 			 DataDir,
! 			 (long) MyStartTime,
! 			 PostPortNumber,
  #ifdef HAVE_UNIX_SOCKETS
! 			 (*UnixSocketDir != '\0') ? UnixSocketDir : DEFAULT_PGSOCKET_DIR
  #else
! 			 ""
  #endif
! 			 );
  
! 	errno = 0;
! 	if (write(fd, buffer, strlen(buffer)) != strlen(buffer))
! 	{
! 		int			save_errno = errno;
! 
! 		close(fd);
! 		unlink(filename);
! 		/* if write didn't set errno, assume problem is no disk space */
! 		errno = save_errno ? save_errno : ENOSPC;
! 		ereport(FATAL,
! 				(errcode_for_file_access(),
! 				 errmsg("could not write lock file \"%s\": %m", filename)));
! 	}
! 	if (pg_fsync(fd) != 0)
! 	{
! 		int			save_errno = errno;
! 
! 		close(fd);
! 		unlink(filename);
! 		errno = save_errno;
! 		ereport(FATAL,
! 				(errcode_for_file_access(),
! 				 errmsg("could not write lock file \"%s\": %m", filename)));
! 	}
! 	if (close(fd) != 0)
! 	{
! 		int			save_errno = errno;
  
! 		unlink(filename);
! 		errno = save_errno;
! 		ereport(FATAL,
! 				(errcode_for_file_access(),
! 				 errmsg("could not write lock file \"%s\": %m", filename)));
! 	}
  
! 	/*
! 	 * Arrange for automatic removal of lockfile at proc_exit.
! 	 */
! 	on_proc_exit(UnlinkLockFile, PointerGetDatum(strdup(filename)));
  }
  
  /*
!  * Create the data directory lockfile.
!  *
!  * When this is called, we must have already switched the working
!  * directory to DataDir, so we can just use a relative path.  This
!  * helps ensure that we are locking the directory we should be.
   */
! void
! CreateDataDirLockFile(bool amPostmaster)
  {
! 	CreateLockFile(DIRECTORY_LOCK_FILE, amPostmaster, true, DataDir);
  }
  
  /*
--- 660,888 ----
   * A data-directory lockfile can optionally contain a third line, containing
   * the key and ID for the shared memory block used by this postmaster.
   *
   *-------------------------------------------------------------------------
   */
  
! /* We hold onto the lockFile for the life of the process to hold onto the advisory locks. */
! static int DataDirLockFileFD = 0;
  
  /*
!  * Create the data directory lockfile.
   *
!  * When this is called, we must have already switched the working
!  * directory to DataDir, so we can just use a relative path.  This
!  * helps ensure that we are locking the directory we should be.
   */
! void
! CreateDataDirLockFile(bool amPostmaster,bool blockOptionFlag)
  {
!   char *lockFilePath=DIRECTORY_LOCK_FILE;
!   CreateLockFileValue error = CreateLockFile(lockFilePath,amPostmaster,&DataDirLockFileFD,blockOptionFlag);
!   if(error==CreateLockFileNoError)
!     {
!       return;
!     }
!   else if(error==CreateLockFileFileError)
!     {
!       ereport(FATAL,
!               (errmsg("failed operation on lock file at \"%s\": %m",lockFilePath)));
!     }
!   else if(error==CreateLockFileSharedLockAcquisitionError)
!     {
!       ereport(FATAL,
!               (errmsg("failed to acquire shared lock on file \"%s\": %m",lockFilePath)));
!     }
!   else if(error==CreateLockFileExclusiveLockCheckError)
!     {
!       ereport(FATAL,
!               (errmsg("failed to check for exclusive lock on file \"%s\": %m", lockFilePath)));
! 
!     }
!   else if(error>0)
!     {
!       /* error holds the pid of the conflicting process */
!       ereport(FATAL,
!               (errmsg("another postgresql process is running in the data directory \"%s\" with pid %d",DataDir,error),
!                errhint("kill the other server processes to start a new postgresql server")));
!     }
!   else
!     {
!       ereport(FATAL,
!               (errmsg("an unhandled locking error occurred")));
!     }
! }
  
! static CreateLockFileValue
! CreateLockFile(char *lockFilePath,bool amPostmaster,int *lockFileRetDescriptor,bool blockOnLockFlag)
! {
!   int success = 0;
!   pid_t pidHoldingLock;
!   int lockFileFD = 0;
! 
!   /* open the directory lock file- whether or not it already exists is irrelevant because we will check for file locks */
!   lockFileFD = open(lockFilePath, O_RDWR | O_CREAT, 0600);
!   if(lockFileFD<0)
!     {
!       return CreateLockFileFileError;
!     }
! 
!   if(!blockOnLockFlag)
!     {
!       /* Acquire the shared advisory lock without blocking*/
!       
!       success = AcquireLock(lockFileFD,false,false); 
!       if(success < 0)
!         {
!           /* We failed to acquire the read lock, which is unlikely because we no one should be holding an exclusive lock on it. */
!           return CreateLockFileSharedLockAcquisitionError;
!         }
!     }
!   else 
!     {
!       /* loop until we get the exclusive lock and the subsequent shared lock- waiting until the data directory is not being serviced is data directory postgresql hot standy mode*/
!       while(1)
!         {
!           success = AcquireLock(lockFileFD,true,true);
!           if(success < 0)
!             return CreateLockFileExclusiveLockCheckError;
!           
!           /* now we hold the exclusive lock, so demote to read lock, but be wary of the race condition whereby a different postmaster could also be waiting to grab the read lock too */
!           success = ReleaseLock(lockFileFD);
!           if(success < 0)
!             return CreateLockFileExclusiveLockCheckError;
!           
!           success = AcquireLock(lockFileFD,false,false);
!           if(success < 0)
!             {
!               /* d'oh- some other postmaster grabbed the exclusive lock in the meantime, so try again later */
!               pg_usleep(500L);
!               continue;
!             }
!           else
!             break;
!         }
!     }
! 
!   
!   /* Determine if acquiring an exclusive (write) lock would be denied. If so, there is another postmaster or postgres child process running, so abort. */
!   pidHoldingLock = GetPIDHoldingLock(lockFileFD,true);
!   if(pidHoldingLock < 0)
!     {
!       /* checking for a lock failed */
!       return CreateLockFileExclusiveLockCheckError;
!     }
!   else if(pidHoldingLock > 0)
!     {
!       /* there is another process holding the lock, so we must abort starting a new postmaster */
!       return pidHoldingLock;
!     }
!   /*no process would block the lock, so we are cleared for starting a new postmaster*/
!   WriteLockFileContents(lockFilePath,lockFileFD,
!                         true,
!                         getpid(),
!                         DataDir,
!                         (long)MyStartTime,
!                         PostPortNumber,
  #ifdef HAVE_UNIX_SOCKETS
!                         (*UnixSocketDir != '\0') ? UnixSocketDir : DEFAULT_PGSOCKET_DIR
  #else
!                          ""
  #endif
!                         );
!   /* There is no need to remove the lock file because the locks synchronize access, not the existence of the file. */
  
!   if(lockFileRetDescriptor != NULL)
!     *lockFileRetDescriptor = lockFileFD;
!   return CreateLockFileNoError;
! }
  
! static void WriteLockFileContents(char *lockFilePath,int lockFileFD,bool isPostmasterFlag,pid_t processPid,char *dataDirectoryPath,long startTime,int portNumber,char * socketDirectoryPath)
! {
!   char writeBuffer[MAXPGPATH * 2 + 256];  
!   snprintf(writeBuffer,sizeof(writeBuffer),"%d\n%s\n%ld\n%d\n%s\n",
!            isPostmasterFlag ? (int) processPid : -((int) processPid),
!            dataDirectoryPath,
!            startTime,
!            portNumber,
!            socketDirectoryPath
!            );
!   errno = 0;
!   if (write(lockFileFD, writeBuffer, strlen(writeBuffer)) != strlen(writeBuffer))
!     {
!       int                     save_errno = errno;
!       unlink(lockFilePath);
!       /* if write didn't set errno, assume problem is no disk space */
!       errno = save_errno ? save_errno : ENOSPC;
!       ereport(FATAL,
!               (errcode_for_file_access(),
!                errmsg("could not write lock file \"%s\": %m", lockFilePath)));
!     }
!   if (pg_fsync(lockFileFD) != 0)
!     {
!       int                     save_errno = errno;
! 
!       unlink(lockFilePath);
!       errno = save_errno;
!       ereport(FATAL,
!               (errcode_for_file_access(),
!                errmsg("could not write lock file \"%s\": %m", lockFilePath)));
!     }
!   return;
! }
  
! /* Called by pg_ctl to determine when the postmaster is shutdown. */
! pid_t GetPIDHoldingDataDirLock(void)
! {
! 	return GetPIDHoldingLock(DataDirLockFileFD,true);
  }
  
+ 
  /*
!  * Called by backends when they startup to signify that the data directory is in use
   */
! void AcquireDataDirLock(void)
  {
!   int success;
!   int exclusiveLockViolatingPID = 0;
!   /* get the read lock */
!   success = AcquireLock(DataDirLockFileFD,false,false); 
!   if(success < 0)
!     {
!       /* Failed to acquire read lock, bomb out */
!       ereport(FATAL,
!               (errmsg("failed to acquire lock on \"%s\": %m",DIRECTORY_LOCK_FILE)));
! 
!     }
!   /* verify that grabbing an exclusive lock would complain that the parent or PROC_ARRAY sibling process would cause exclusive lock acquisition to fail- otherwise a separate postmaster is holding the lock (eliminates a possible race condition when the postmaster spawns a backend, immediately dies and new postmaster takes over) */
!   exclusiveLockViolatingPID = GetPIDHoldingLock(DataDirLockFileFD,true);
!   if(exclusiveLockViolatingPID < 0)
!     {
!       /* error testing for the lock, very unlikely, but fatal */
!       ereport(FATAL,
!               (errmsg("failed to test for lock on \"%s\": %m",DIRECTORY_LOCK_FILE)));
!     }
!   else if(exclusiveLockViolatingPID == 0)
!     {
!       /* the postmaster should be holding the lock- in this case it is not (and this is the only backend running), so don't bother running the backend because the postmaster just died */
!       ereport(FATAL,
!               (errmsg("failed to initialize backend because the postmaster exited")));
!     }
!   else
!     {
!       /* the PID is valid, so we should check that the PID refers either to the postmaster or its children */
!       /* NOTE TO REVIEWER: is this too early to call BackendPidGetProc? */ 
!       PGPROC *violatingProc = NULL;
!       violatingProc = BackendPidGetProc(exclusiveLockViolatingPID);
! 
!       if(exclusiveLockViolatingPID != getppid() && 
!          violatingProc == NULL)
!         {
!           /* the violating lock is neither the postmaster nor a sibling child- data directory conflict detected! */
!           ereport(FATAL,
!                   (errmsg("backend startup race condition detected- another postmaster is running in this data directory")));
!         }
!     }
!   return;
  }
  
  /*
***************
*** 966,977 **** CreateDataDirLockFile(bool amPostmaster)
  void
  CreateSocketLockFile(const char *socketfile, bool amPostmaster)
  {
! 	char		lockfile[MAXPGPATH];
! 
! 	snprintf(lockfile, sizeof(lockfile), "%s.lock", socketfile);
! 	CreateLockFile(lockfile, amPostmaster, false, socketfile);
! 	/* Save name of lockfile for TouchSocketLockFile */
! 	strcpy(socketLockFile, lockfile);
  }
  
  /*
--- 891,939 ----
  void
  CreateSocketLockFile(const char *socketfile, bool amPostmaster)
  {
!   char lockFilePath[MAXPGPATH];
!   CreateLockFileValue error;
! 
!   snprintf(lockFilePath, sizeof(lockFilePath), "%s.lock", socketfile);
! 
!   /* This intentionally leaks the socket file descriptor- we hold onto it so that the lock is held until the process is exited */
!   error = CreateLockFile(lockFilePath, amPostmaster,NULL,false);
!   
!   if(error==CreateLockFileNoError)
!     {
!       return;
!     }
!   else if(error==CreateLockFileFileError)
!     {
!       ereport(FATAL,
!               (errmsg("failed operation on lock file at \"%s\": %m",lockFilePath)));
!     }
!   else if(error==CreateLockFileSharedLockAcquisitionError)
!     {
!       ereport(FATAL,
!               (errmsg("failed to acquire shared lock on file \"%s\": %m",lockFilePath)));
!     }
!   else if(error==CreateLockFileExclusiveLockCheckError)
!     {
!       ereport(FATAL,
!               (errmsg("failed to check for exclusive lock on file \"%s\": %m", lockFilePath)));
!       
!     }
!   else if(error>0)
!     {
!       /* error holds the pid of the conflicting process */
!       ereport(FATAL,
!               (errmsg("another postgresql process with pid %d is bound to the socket file at \"%s\"",error,lockFilePath),
!                errhint("configure a different socket file path in postgresql.conf or kill the conflicting postgresql server")));
!     }
!   else
!     {
!       ereport(FATAL,
!               (errmsg("an unhandled locking error occurred")));
!     }
!   
!   /* Save name of lockfile for TouchSocketLockFile */
!   strcpy(socketLockFile, lockFilePath);
  }
  
  /*
***************
*** 985,1020 **** CreateSocketLockFile(const char *socketfile, bool amPostmaster)
  void
  TouchSocketLockFile(void)
  {
! 	/* Do nothing if we did not create a socket... */
! 	if (socketLockFile[0] != '\0')
! 	{
! 		/*
! 		 * utime() is POSIX standard, utimes() is a common alternative; if we
! 		 * have neither, fall back to actually reading the file (which only
! 		 * sets the access time not mod time, but that should be enough in
! 		 * most cases).  In all paths, we ignore errors.
! 		 */
  #ifdef HAVE_UTIME
! 		utime(socketLockFile, NULL);
! #else							/* !HAVE_UTIME */
  #ifdef HAVE_UTIMES
! 		utimes(socketLockFile, NULL);
! #else							/* !HAVE_UTIMES */
! 		int			fd;
! 		char		buffer[1];
! 
! 		fd = open(socketLockFile, O_RDONLY | PG_BINARY, 0);
! 		if (fd >= 0)
! 		{
! 			read(fd, buffer, sizeof(buffer));
! 			close(fd);
! 		}
  #endif   /* HAVE_UTIMES */
  #endif   /* HAVE_UTIME */
! 	}
  }
  
- 
  /*
   * Add (or replace) a line in the data directory lock file.
   * The given string should not include a trailing newline.
--- 947,980 ----
  void
  TouchSocketLockFile(void)
  {
!   /* Do nothing if we did not create a socket... */
!   if (socketLockFile[0] != '\0')
!     {
!       /*
!        * utime() is POSIX standard, utimes() is a common alternative; if we
!        * have neither, fall back to actually reading the file (which only
!        * sets the access time not mod time, but that should be enough in
!        * most cases).  In all paths, we ignore errors.
!        */
  #ifdef HAVE_UTIME
!       utime(socketLockFile, NULL);
! #else/* !HAVE_UTIME */
  #ifdef HAVE_UTIMES
!       utimes(socketLockFile, NULL);
! #else/* !HAVE_UTIMES */
!       intfd;
!       charbuffer[1];
! 
!       fd = open(socketLockFile, O_RDONLY | PG_BINARY, 0);
!       if (fd >= 0)
!         {
!           read(fd, buffer, sizeof(buffer));
!         }
  #endif   /* HAVE_UTIMES */
  #endif   /* HAVE_UTIME */
!     }
  }
  
  /*
   * Add (or replace) a line in the data directory lock file.
   * The given string should not include a trailing newline.
***************
*** 1030,1043 **** AddToDataDirLockFile(int target_line, const char *str)
  	int			lineno;
  	char	   *ptr;
  	char		buffer[BLCKSZ];
  
! 	fd = open(DIRECTORY_LOCK_FILE, O_RDWR | PG_BINARY, 0);
! 	if (fd < 0)
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open file \"%s\": %m",
! 						DIRECTORY_LOCK_FILE)));
  		return;
  	}
  	len = read(fd, buffer, sizeof(buffer) - 1);
--- 990,1006 ----
  	int			lineno;
  	char	   *ptr;
  	char		buffer[BLCKSZ];
+         int success;
  
! 	fd = DataDirLockFileFD;
!         /* rewind the file handle to rewrite it */
!         success = lseek(fd,0,SEEK_SET);
! 	if (success < 0)
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not seek lock file \"%s\": %m",
!                                         DIRECTORY_LOCK_FILE)));
  		return;
  	}
  	len = read(fd, buffer, sizeof(buffer) - 1);
***************
*** 1047,1053 **** AddToDataDirLockFile(int target_line, const char *str)
  				(errcode_for_file_access(),
  				 errmsg("could not read from file \"%s\": %m",
  						DIRECTORY_LOCK_FILE)));
- 		close(fd);
  		return;
  	}
  	buffer[len] = '\0';
--- 1010,1015 ----
***************
*** 1061,1067 **** AddToDataDirLockFile(int target_line, const char *str)
  		if ((ptr = strchr(ptr, '\n')) == NULL)
  		{
  			elog(LOG, "bogus data in \"%s\"", DIRECTORY_LOCK_FILE);
- 			close(fd);
  			return;
  		}
  		ptr++;
--- 1023,1028 ----
***************
*** 1088,1094 **** AddToDataDirLockFile(int target_line, const char *str)
  				(errcode_for_file_access(),
  				 errmsg("could not write to file \"%s\": %m",
  						DIRECTORY_LOCK_FILE)));
- 		close(fd);
  		return;
  	}
  	if (pg_fsync(fd) != 0)
--- 1049,1054 ----
***************
*** 1098,1110 **** AddToDataDirLockFile(int target_line, const char *str)
  				 errmsg("could not write to file \"%s\": %m",
  						DIRECTORY_LOCK_FILE)));
  	}
- 	if (close(fd) != 0)
- 	{
- 		ereport(LOG,
- 				(errcode_for_file_access(),
- 				 errmsg("could not write to file \"%s\": %m",
- 						DIRECTORY_LOCK_FILE)));
- 	}
  }
  
  
--- 1058,1063 ----
***************
*** 1300,1302 **** pg_bindtextdomain(const char *domain)
--- 1253,1300 ----
  	}
  #endif
  }
+ 
+ /* We can also offer the option to block until the other postmaster is cleared away using F_SETLKW */
+ static int AcquireLock(int fileDescriptor,bool exclusiveLockFlag,bool waitForLock)
+ {
+   struct flock lock = { 
+     .l_type = exclusiveLockFlag ? F_WRLCK : F_RDLCK,
+     .l_start = 0,
+     .l_whence = SEEK_SET,
+     .l_len = 100 
+   };
+   return fcntl(fileDescriptor , waitForLock ? F_SETLKW : F_SETLK, &lock);
+ }
+ 
+ static int ReleaseLock(int fileDescriptor)
+ {
+   struct flock lock = {
+     .l_type = F_UNLCK,
+     .l_start = 0,
+     .l_whence = SEEK_SET,
+     .l_len = 100
+   };
+   return fcntl(fileDescriptor, F_SETLK, &lock);
+ }
+ 
+ static pid_t GetPIDHoldingLock(int fileDescriptor,bool exclusiveLockFlag)
+ {
+   struct flock lock = {
+     .l_type = exclusiveLockFlag ? F_WRLCK : F_RDLCK,
+     .l_start = 0,
+     .l_whence = SEEK_SET,
+     .l_len = 100,
+     .l_pid = 0
+   };
+   int success;
+ 
+   success = fcntl(fileDescriptor,F_GETLK,&lock);
+   if(success < 0)
+     {
+       return (pid_t)success;
+     }
+   if(lock.l_whence == SEEK_SET)
+     return lock.l_pid;
+   else
+     return -1;
+ }
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 26,31 ****
--- 26,32 ----
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <unistd.h>
+ #include <fcntl.h>
  
  #ifdef HAVE_SYS_RESOURCE_H
  #include <sys/time.h>
***************
*** 271,297 **** get_pgpid(void)
  {
  	FILE	   *pidf;
  	long		pid;
  
  	pidf = fopen(pid_file, "r");
! 	if (pidf == NULL)
! 	{
! 		/* No pid file, not an error on startup */
! 		if (errno == ENOENT)
! 			return 0;
! 		else
! 		{
! 			write_stderr(_("%s: could not open PID file \"%s\": %s\n"),
! 						 progname, pid_file, strerror(errno));
! 			exit(1);
! 		}
! 	}
! 	if (fscanf(pidf, "%ld", &pid) != 1)
! 	{
! 		write_stderr(_("%s: invalid data in PID file \"%s\"\n"),
! 					 progname, pid_file);
! 		exit(1);
! 	}
  	fclose(pidf);
  	return (pgpid_t) pid;
  }
  
--- 272,320 ----
  {
  	FILE	   *pidf;
  	long		pid;
+         struct flock lock = {
+           .l_type =  F_WRLCK,
+           .l_start = 0,
+           .l_whence = SEEK_SET,
+           .l_len = 100,
+           .l_pid = 0
+         };
+         int success;
  
  	pidf = fopen(pid_file, "r");
! 
!         /* Attempt to acquire an exclusive lock. If that fails, we know that an existing backend is still holding a read lock. See src/backend/utils/init/miscinit.c for more details. */
!         if(pidf == NULL)
!           {
!             if(errno == ENOENT)
!               {
!                 /* No lock file found. */
!                 return 0;
!               }
!             else 
!               {
!                 write_stderr(_("%s: could not open PID file \"%s\": %s\n"),
!                              progname,pid_file,strerror(errno));
!                 exit(1);
!               }
!           }
!         success = fcntl(fileno(pidf),F_GETLK,&lock);
  	fclose(pidf);
+         if(success < 0)
+           {
+             /* Failed syscall */
+             write_stderr(_("%s: failed to test lock status: %s"),progname,strerror(errno));
+             exit(1);
+           }
+         if(lock.l_whence == SEEK_SET)
+           /* There is a pid holding a lock. */
+           return lock.l_pid;
+         else if(lock.l_type == F_UNLCK)
+           /* No lock would block exclusive access. */
+           return 0;
+         else
+           return -1;
+ 
  	return (pgpid_t) pid;
  }
  
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
***************
*** 373,379 **** extern char *local_preload_libraries_string;
  #define LOCK_FILE_LINE_LISTEN_ADDR	6
  #define LOCK_FILE_LINE_SHMEM_KEY	7
  
! extern void CreateDataDirLockFile(bool amPostmaster);
  extern void CreateSocketLockFile(const char *socketfile, bool amPostmaster);
  extern void TouchSocketLockFile(void);
  extern void AddToDataDirLockFile(int target_line, const char *str);
--- 373,381 ----
  #define LOCK_FILE_LINE_LISTEN_ADDR	6
  #define LOCK_FILE_LINE_SHMEM_KEY	7
  
! extern void CreateDataDirLockFile(bool amPostmaster,bool blockOptionFlag);
! extern void AcquireDataDirLock(void);
! extern pid_t GetPIDHoldingDataDirLock(void);
  extern void CreateSocketLockFile(const char *socketfile, bool amPostmaster);
  extern void TouchSocketLockFile(void);
  extern void AddToDataDirLockFile(int target_line, const char *str);

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: A.M. (#6)

Re: POSIX shared memory redux

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directory lock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster child starts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, it queries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is. I was
imagining keeping some portion of the data in sysv shm, and moving the
big stuff to a POSIX shm that would operate alongside it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

A.M.

agentm@themactionfaction.com

almost 15 years ago

In reply to: Robert Haas (#7)

Re: POSIX shared memory redux

On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directory lock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster child starts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, it queries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is. I was
imagining keeping some portion of the data in sysv shm, and moving the
big stuff to a POSIX shm that would operate alongside it.

What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits. The mini-SysV proposal was implemented and shot down by Tom Lane.

Cheers,
M

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: A.M. (#8)

Re: POSIX shared memory redux

On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:

On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directory lock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster child starts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, it queries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is. I was
imagining keeping some portion of the data in sysv shm, and moving the
big stuff to a POSIX shm that would operate alongside it.

What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits. The mini-SysV proposal was implemented and shot down by Tom Lane.

I mean I'm not convinced that fcntl() locking will be as reliable.

I know Tom shot that down before, but I still think it's probably the
best way forward. The advantage I see is that we would be able to
more easily allocate larger chunks of shared memory with changing
kernel parameters, and perhaps even to dynamically resize shared
memory chunks. That'd be worth the price of admission even if we
didn't get all those benefits in one commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#9)

Re: POSIX shared memory redux

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:

What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits. The mini-SysV proposal was implemented and shot down by Tom Lane.

I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either. Particularly not on NFS. (Although on NFS you have
other issues to worry about too, like postmasters on different machines
being able to reach the same data directory. I wonder if we should do
both SysV and fcntl locking ...)

I know Tom shot that down before, but I still think it's probably the
best way forward.

Did I? I think I pointed out that there's zero gain in portability as
long as we still depend on SysV shmem to work. However, if you're doing
it for other reasons than portability, it might make sense anyway. The
question is whether there are adequate other reasons.

The advantage I see is that we would be able to
more easily allocate larger chunks of shared memory with changing
kernel parameters,

Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

and perhaps even to dynamically resize shared memory chunks.

This I don't really believe will ever work reliably, especially not in
32-bit machines. Whatever your kernel API is, you still have the
problem of finding address space contiguous to what you were already
using.

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#7)

Re: POSIX shared memory redux

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directory lock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster child starts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, it queries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is.

BTW, the above-described solution flat out doesn't work anyway, because
it has a race condition. Postmaster children have to reacquire the lock
after forking, because fcntl locks aren't inherited during fork(). And
that means you can't tell whether there's a just-started backend that
hasn't yet acquired the lock. It's really critical for our purposes
that SysV shmem segments are inherited at fork() and so there's no
window where a just-forked backend isn't visible to somebody checking
the state of the shmem segment.

regards, tom lane

#12

A.M.

agentm@themactionfaction.com

almost 15 years ago

In reply to: Tom Lane (#11)

Re: POSIX shared memory redux

On Apr 11, 2011, at 7:25 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directory lock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster child starts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, it queries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is.

BTW, the above-described solution flat out doesn't work anyway, because
it has a race condition. Postmaster children have to reacquire the lock
after forking, because fcntl locks aren't inherited during fork(). And
that means you can't tell whether there's a just-started backend that
hasn't yet acquired the lock. It's really critical for our purposes
that SysV shmem segments are inherited at fork() and so there's no
window where a just-forked backend isn't visible to somebody checking
the state of the shmem segment.

Then you haven't looked at my patch because I address this race condition by ensuring that a lock-holding violator is the postmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the shared memory. POSIX shmem is inherited via file descriptors.

Cheers,
M

#13

A.M.

agentm@themactionfaction.com

almost 15 years ago

In reply to: Tom Lane (#10)

Re: POSIX shared memory redux

On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:

What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits. The mini-SysV proposal was implemented and shot down by Tom Lane.

I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either. Particularly not on NFS. (Although on NFS you have
other issues to worry about too, like postmasters on different machines
being able to reach the same data directory. I wonder if we should do
both SysV and fcntl locking ...)

Is there an example of a recent system where fcntl is broken (ignoring NFS)? I believe my patch addresses all potential race conditions and uses the APIs properly to guarantee single-postmaster data directory usage and I tested on Darwin and a two-year-old Linux kernel. In the end, fcntl locking relies on the same kernel which provides the SysV user count, so I'm not sure what makes it less "reliable", but I have heard that twice now, so I am open to hearing about your experiences.

I know Tom shot that down before, but I still think it's probably the
best way forward.

Did I? I think I pointed out that there's zero gain in portability as
long as we still depend on SysV shmem to work. However, if you're doing
it for other reasons than portability, it might make sense anyway. The
question is whether there are adequate other reasons.

I provided an example of postmaster-failover relying on F_SETLKW in the email with the patch. Also, as you point out above, fcntl locking at least has a chance of working over NFS.

The advantage I see is that we would be able to
more easily allocate larger chunks of shared memory with changing
kernel parameters,

Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

Yes, please! That is my primary motivation for this patch.

and perhaps even to dynamically resize shared memory chunks.

This I don't really believe will ever work reliably, especially not in
32-bit machines. Whatever your kernel API is, you still have the
problem of finding address space contiguous to what you were already
using.

Even if expanding shmem involves copying large regions of memory, it could be at least useful to adjust buffer sizes live without a restart.

Cheers,
M

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: A.M. (#13)

Re: POSIX shared memory redux

"A.M." <agentm@themactionfaction.com> writes:

On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either. Particularly not on NFS.

Is there an example of a recent system where fcntl is broken (ignoring NFS)?

Well, the fundamental point is that "ignoring NFS" is not the real
world. We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it. And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

None of this is to say that an fcntl lock might not be a useful addition
to what we do already. It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.

regards, tom lane

#15

A.M.

agentm@themactionfaction.com

almost 15 years ago

In reply to: Tom Lane (#14)

Re: POSIX shared memory redux

On Apr 13, 2011, at 2:06 AM, Tom Lane wrote:

"A.M." <agentm@themactionfaction.com> writes:

On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either. Particularly not on NFS.

Is there an example of a recent system where fcntl is broken (ignoring NFS)?

Well, the fundamental point is that "ignoring NFS" is not the real
world. We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it. And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

None of this is to say that an fcntl lock might not be a useful addition
to what we do already. It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.

The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point out, is virtually impossible.

As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared memory won't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID of process on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely augments the file with file locking.

At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for shared-storage postgresql failover, which is a new feature.

Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so only the postmaster and its children have access to it (through file descriptor inheritance). This means shared memory cannot be stomped on by any other process.

Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst possible scenario, it doesn't change current behavior, I don't see how this can be a ding against this patch.

Cheers,
M

#16

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: A.M. (#15)

Re: POSIX shared memory redux

On Wed, Apr 13, 2011 at 7:20 AM, A.M. <agentm@themactionfaction.com> wrote:

The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point out, is virtually impossible.

As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared memory won't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID of process on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely augments the file with file locking.

At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for shared-storage postgresql failover, which is a new feature.

Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so only the postmaster and its children have access to it (through file descriptor inheritance). This means shared memory cannot be stomped on by any other process.

Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst possible scenario, it doesn't change current behavior, I don't see how this can be a ding against this patch.

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

In answer to your off-list question, one of the principle ways I've
seen fcntl() locking fall over and die is when someone removes the
lock file. You might think that this could be avoided by picking
something important like pg_control as the log file, but it turns out
that doesn't really work:

http://0pointer.de/blog/projects/locking.html

Tom's point is valid too. Many storage appliances present themselves
as an NFS server, so it's very plausible for the data directory to be
on an NFS server, and there's no guarantee that flock() won't be
broken there. If our current interlock were known to be unreliable
also maybe we wouldn't care very much, but AFAICT it's been extremely
robust.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#16)

Re: POSIX shared memory redux

Robert Haas <robertmhaas@gmail.com> writes:

In answer to your off-list question, one of the principle ways I've
seen fcntl() locking fall over and die is when someone removes the
lock file. You might think that this could be avoided by picking
something important like pg_control as the log file, but it turns out
that doesn't really work:

http://0pointer.de/blog/projects/locking.html

Hm, I wasn't aware of the fact that you lose an fcntl() lock if you
close a *different* file descriptor for the same file. My goodness,
that's horrid :-(. It basically means that any third-party code running
in a backend could accidentally defeat the locking, and most of the time
you'd never even know it had happened ... as opposed to what would
happen if you detached from shmem for instance. You could run with such
code for years, and probably never know why sometimes the interlock
failed to work when you needed it to.

regards, tom lane

#18

A.M.

agentm@themactionfaction.com

over 14 years ago

In reply to: Robert Haas (#16)

Re: POSIX shared memory redux

On Apr 13, 2011, at 8:36 PM, Robert Haas wrote:

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific document on how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run more postgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the postgresql project to delete that page and the hassles with it.

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

In answer to your off-list question, one of the principle ways I've
seen fcntl() locking fall over and die is when someone removes the
lock file. You might think that this could be avoided by picking
something important like pg_control as the log file, but it turns out
that doesn't really work:

http://0pointer.de/blog/projects/locking.html

Tom's point is valid too. Many storage appliances present themselves
as an NFS server, so it's very plausible for the data directory to be
on an NFS server, and there's no guarantee that flock() won't be
broken there. If our current interlock were known to be unreliable
also maybe we wouldn't care very much, but AFAICT it's been extremely
robust.

Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true. I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking on top of the lock file to replace the SysV shmem check. If someone deletes the postgresql lock file over NFS, the data directory is equally screwed, but with my patch there is chance that two machines sharing a properly-configured NFS mount can properly interlock- postgresql cannot offer that today, so this is a feature upgrade with no loss. The worst case scenario is today's behavior.

My original goal remains to implement POSIX shared memory, but Tom Lane was right to point out that the current interlocking check relies on SysV, so, even though the startup locking is really orthogonal to shared memory, I implemented what could be considered a separate patch for that and rolled it into one.

I would encourage you to take a look at the patch.

Cheers,
M

#19

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: A.M. (#18)

Re: POSIX shared memory redux

On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote:

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific document on how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run more postgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the postgresql project to delete that page and the hassles with it.

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

With all due respect, I think this is an unproductive conversation.
Your patch proves that SysV is wholly unnecessary only if we also
agree that fcntl() locking is just as reliable as the nattch
interlock, and Tom and I are trying to explain why we don't believe
that's the case. Saying that we're just wrong without responding to
our points substantively doesn't move the conversation forward.

In case it's not clear, here again is what we're concerned about: A
System V shm *cannot* be removed until nobody is attached to it. A
lock file can be removed, or the lock can be accidentally released by
the apparently innocuous operation of closing a file descriptor.

Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true. I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking on top of the lock file to replace the SysV shmem check.

This seems contradictory. If you replaced the SysV shmem check, then
it's not there, which means you altered the behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: A.M. (#18)

Re: POSIX shared memory redux

"A.M." <agentm@themactionfaction.com> writes:

1. As long one keeps SysV shared memory around, the postgresql project
has to maintain the annoying platform-specific document on how to
configure the poorly named kernel parameters.

No, if it's just a small area, I don't see that that's an issue.
You're going to max out on other things (like I/O bandwidth) long before
you run into the limit on how many postmasters you can have from this.
The reason that those parameters are problematic now is that people tend
to want *large* shmem segments and the typical defaults aren't friendly
to that.

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

You were losing this argument already, but ad hominem attacks are pretty
much guaranteed to get people to tune you out. There are real,
substantive, unfixable reasons to not trust fcntl locking completely.

I would encourage you to take a look at the patch.

Just to be perfectly clear: I have not read your patch, and am not
likely to before the next commitfest starts, because I have
approximately forty times too many things to do already. I'm just going
off your own abbreviated description of the patch. But from what I know
about fcntl locking, it's not a sufficient substitute for the SysV shmem
interlock, because it has failure modes that the SysV interlock doesn't,
and those failure modes occur in real-world cases. Yeah, it'd be nice
to also be able to lock against other postmasters on other NFS clients,
but we hardly ever hear of somebody getting burnt by the lack of that
(and fcntl wouldn't be a bulletproof defense anyway). On the other
hand, accidentally trying to start a duplicate postmaster on the same
machine is an everyday occurrence.

regards, tom lane

#21

Florian Weimer

fweimer@bfk.de

over 14 years ago

In reply to: Tom Lane (#14)

Re: POSIX shared memory redux

* Tom Lane:

Well, the fundamental point is that "ignoring NFS" is not the real
world. We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it. And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

Is this still the case with NFSv4? Does the local daemon still keep
the lock state?

None of this is to say that an fcntl lock might not be a useful addition
to what we do already. It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.

If it requires NFS misbehavior (possibly in an older version), and you
have to start postmasters on separate nodes (which you normally
wouldn't do), doesn't this make it increasingly unlikely that it's
going to be triggered in the wild?

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#22

A.M.

agentm@themactionfaction.com

over 14 years ago

In reply to: Robert Haas (#19)

Re: POSIX shared memory redux

On Apr 13, 2011, at 9:30 PM, Robert Haas wrote:

On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote:

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific document on how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run more postgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the postgresql project to delete that page and the hassles with it.

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

With all due respect, I think this is an unproductive conversation.
Your patch proves that SysV is wholly unnecessary only if we also
agree that fcntl() locking is just as reliable as the nattch
interlock, and Tom and I are trying to explain why we don't believe
that's the case. Saying that we're just wrong without responding to
our points substantively doesn't move the conversation forward.

Sorry- it wasn't meant to be an attack- just a dumb pun. I am trying to argue that, even if the fcntl is unreliable, the startup procedure is just as reliable as it is now. The reasons being:

1) the SysV nattch method's primary purpose is to protect the shmem region. This is no longer necessary in my patch because the shared memory in unlinked immediately after creation, so only the initial postmaster and its children have access.

2) the standard postgresql lock file remains the same

Furthermore, there is indeed a case where the SysV nattch cannot work while the fcntl locking can indeed catch: if two separate machines have a postgresql data directory mounted over NFS, postgresql will currently allow both machines to start a postmaster in that directory because the SysV nattch check fails and then the pid in the lock file is the pid on the first machine, so postgresql will say "starting anyway". With fcntl locking, this can be fixed. SysV only has presence on one kernel.

In case it's not clear, here again is what we're concerned about: A
System V shm *cannot* be removed until nobody is attached to it. A
lock file can be removed, or the lock can be accidentally released by
the apparently innocuous operation of closing a file descriptor.

Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true. I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking on top of the lock file to replace the SysV shmem check.

This seems contradictory. If you replaced the SysV shmem check, then
it's not there, which means you altered the behavior.

From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple stompers. The interlock was a neat side-effect.

The lock file contents are currently important to get the pid of a potential, conflicting postmaster. With the fcntl API, we can return a live conflicting PID (whether a postmaster or a stuck child), so that's an improvement. This could be used, for example, for STONITH, to reliably kill a dying replication clone- just loop on the pids returned from the lock.

Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same.

If you were to implement a daemon with a shared data directory but no shared memory, how would implement the interlock? Would you still insist on SysV shmem? Unix daemons generally rely on lock files alone. Perhaps there is a different API on which we can agree.

Cheers,
M

#23

A.M.

agentm@themactionfaction.com

over 14 years ago

In reply to: Florian Weimer (#21)

Re: POSIX shared memory redux

On Apr 14, 2011, at 8:22 AM, Florian Weimer wrote:

* Tom Lane:

Well, the fundamental point is that "ignoring NFS" is not the real
world. We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it. And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

Is this still the case with NFSv4? Does the local daemon still keep
the lock state?

The lock handling has been fixed in NFSv4.

http://nfs.sourceforge.net/
"NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its open and lock leases."

http://linux.die.net/man/2/flock
"flock(2) does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent version of Linux and a server which supports locking."

I would need some more time to dig up what "recent version of Linux" specifies, but NFSv4 is likely required.

None of this is to say that an fcntl lock might not be a useful addition
to what we do already. It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.

If it requires NFS misbehavior (possibly in an older version), and you
have to start postmasters on separate nodes (which you normally
wouldn't do), doesn't this make it increasingly unlikely that it's
going to be triggered in the wild?

With the patch I offer, it would be possible to use shared storage and failover postgresql nodes on different machines over NFS. (The second postmaster blocks and waits for the lock to be released.) Obviously, such as a setup isn't as strong as using replication, but given a sufficiently fail-safe shared storage setup, it could be made reliable.

Cheers,
M

#24

A.M.

agentm@themactionfaction.com

over 14 years ago

In reply to: Tom Lane (#20)

Re: POSIX shared memory redux

On Apr 13, 2011, at 11:37 PM, Tom Lane wrote:

"A.M." <agentm@themactionfaction.com> writes:

1. As long one keeps SysV shared memory around, the postgresql project
has to maintain the annoying platform-specific document on how to
configure the poorly named kernel parameters.

No, if it's just a small area, I don't see that that's an issue.
You're going to max out on other things (like I/O bandwidth) long before
you run into the limit on how many postmasters you can have from this.
The reason that those parameters are problematic now is that people tend
to want *large* shmem segments and the typical defaults aren't friendly
to that.

That's assuming that no other processes on the system are using up the available segments (such as older postgresql instances).

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

You were losing this argument already, but ad hominem attacks are pretty
much guaranteed to get people to tune you out.

I apologized to Robert Haas in another post- no offense was intended.

There are real,
substantive, unfixable reasons to not trust fcntl locking completely.

...on NFS which the postgresql community doesn't recommend anyway. But even in that case, the existing lock file (even without the fcntl lock), can catch that case via the PID in the file contents. That is what I meant when I claimed that the behavior remains the same.

I would encourage you to take a look at the patch.

Just to be perfectly clear: I have not read your patch, and am not
likely to before the next commitfest starts, because I have
approximately forty times too many things to do already. I'm just going
off your own abbreviated description of the patch. But from what I know
about fcntl locking, it's not a sufficient substitute for the SysV shmem
interlock, because it has failure modes that the SysV interlock doesn't,
and those failure modes occur in real-world cases. Yeah, it'd be nice
to also be able to lock against other postmasters on other NFS clients,
but we hardly ever hear of somebody getting burnt by the lack of that
(and fcntl wouldn't be a bulletproof defense anyway). On the other
hand, accidentally trying to start a duplicate postmaster on the same
machine is an everyday occurrence.

I really do appreciate the time you have put into feedback. I posed this question also to Robert Haas: is there a different API which you would find acceptable? I chose fcntl because it seemed well-suited for this task, but the feedback has been regarding NFS v<4 concerns.

Cheers,
M

#25

Martijn van Oosterhout

kleptog@svana.org

over 14 years ago

In reply to: A.M. (#22)

Re: POSIX shared memory redux

On Thu, Apr 14, 2011 at 10:26:33AM -0400, A.M. wrote:

1) the SysV nattch method's primary purpose is to protect the shmem
region. This is no longer necessary in my patch because the shared
memory in unlinked immediately after creation, so only the initial
postmaster and its children have access.

Umm, you don't unlink SysV shared memory. All the flag does is make
sure it goes away when the last user goes away. In the mean time people
can still connect to it.

The lock file contents are currently important to get the pid of a
potential, conflicting postmaster. With the fcntl API, we can return
a live conflicting PID (whether a postmaster or a stuck child), so
that's an improvement. This could be used, for example, for STONITH,
to reliably kill a dying replication clone- just loop on the pids
returned from the lock.

SysV shared memory also gives you a PID, that's the point.

Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same.

The interlock is to make sure there are no living postmaster children.
The lockfile won't tell you that. So the issue is that while fcntl can
work, sysv can do better.

Also, I think you underestimate the value of the current interlock.
Before this people did manage to trash their databases regularly this
way. Lockfiles can be deleted and yes, people do it all the time.

Actually, it occurs to me you can solve NFS problem by putting the
lockfile in the socket dir. That can't possibly be on NFS.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

#26

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: A.M. (#22)

Re: POSIX shared memory redux

On Thu, Apr 14, 2011 at 7:26 AM, A.M. <agentm@themactionfaction.com> wrote:

From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple stompers. The interlock was a neat side-effect.

Not really - the purpose of the interlock is to protect the underlying
data files. The nattch interlock allows us to be very confident that
there isn't another postmaster running on the same data directory on
the same machine, and that is extremely important.

You've just about convinced me that it might not be a bad idea to add
the fcntl() interlock in addition because, as you say, that has a
chance of working even over NFS. But the interlock we have now is
*extremely* reliable, and I think we'd need to get some other
amazingly compelling benefit to consider changing it (even if we were
convinced that the alternate method was also reliable). I don't see
that there is one. Anyone running an existing version of PostgreSQL
in an environment where they care *at all* about performance has
already adjusted their SysV shm settings way up. The benefit of using
POSIX shm is that in, say, PostgreSQL 9.2, it might be possible for
shared buffers to have a somewhat higher default setting out of the
box, and be further increased from there without kernel parameter
changes. And there might be more benefits besides that, but certainly
those by themselves seem pretty worthwhile. SysV shm is extremely
portable, so I don't think that we're losing anything by continuing to
allocate a small amount of it (a few kilobytes, perhaps) and just push
everything else out into POSIX shm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company