dynamic shared memory
Please find attached a first version of a patch to allow additional
"dynamic" shared memory segments; that is, shared memory segments that
are created after server startup, live for a period of time, and are
then destroyed when no longer needed. The main purpose of this patch
is to facilitate parallel query: if we've got multiple backends
working on the same query, they're going to need a way to communicate.
Doing that through the main shared memory segment seems infeasible
because we could, for some applications, need to share very large
amounts of data. For example, for internal sort, we basically load
the data to be sorted into memory and then rearrange an array of
pointers to the items being sorted. For parallel internal sort, we
might want to do much the same thing, but with different backend
processes manipulating different parts of the array. I'm not exactly
sure how that's going to work out yet in detail, but it seems fair to
say that the amount of data we want to share between processes there
could be quite a bit larger than anything we'd feel comfortable
nailing down in the permanent shared memory segment. Other cases,
like parallel sequential scan, might require much smaller buffers,
since there might not be much point in letting the scan get too far
ahead if nothing's consuming the tuples it produces. With this
infrastructure, we can choose at run-time exactly how much memory to
allocate for a particular purpose and return it to the operating
system as soon as we're done with it.
Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type. Since we
currently require System V shared memory to be supported on all
platforms other than Windows, I have included a System V
implementation (shmget, shmctl, shmat, shmdt). However, as we know,
on many systems, System V shared memory limits are often low out of
the box and raising them is an annoyance for users. Therefore, I've
included an implementation based on POSIX shared memory facilities
(shm_open, shm_unlink), which is the default on systems where those
facilities are supported (some of the BSDs do not, I believe). We
will also need a Windows implementation, which I have not attempted,
but one of my colleagues at EnterpriseDB will be filling in that gap.
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging. On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments. Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.
Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other. There are
ways, on most if not all platforms, to pass file descriptors through
sockets, but there's not automatically a socket connection between the
two processes either, so it gets hairy to think about making this
work. I did, however, include a "none" implementation which has the
effect of shutting the facility off altogether.
The actual implementation is split up into two layers. dsm_impl.c/h
encapsulate the implementation-dependent functionality at a very raw
level, while dsm.c/h wrap that functionality in a more palatable API.
Most of that wrapper layer is concerned with just one problem:
avoiding leaks. This turned out to require multiple levels of
safeguards, which I duly implemented. First, dynamic shared memory
segments need to be reference-counted, so that when the last mapping
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that). If a
backend is terminated uncleanly, the postmaster needs to remove all
leftover segments during the crash-and-restart process, just as it
needs to reinitialize the main shared memory segment. And if all
processes are terminated uncleanly, the next postmaster startup needs
to clean up any segments that still exist, again just as we already do
for the main shared memory segment. Neither POSIX shared memory nor
System V shared memory provide an API for enumerating all existing
shared memory segments, so we must keep track ourselves of what we
have outstanding. Second, we need to ensure, within the scope of an
individual process, that we only retain a mapping for as long as
necessary. Just as memory contexts, locks, buffer pins, and other
resources automatically go away at the end of a query or
(sub)transaction, dynamic shared memory mappings created for a purpose
such as parallel sort need to go away if we abort mid-way through. Of
course, if you have a user backend coordinating with workers, it seems
pretty likely that the workers are just going to exit if they hit an
error, so having the mapping be process-lifetime wouldn't necessarily
be a big deal; but the user backend may stick around for a long time
and execute other queries, and we can't afford to have it accumulate
mappings, not least because that's equivalent to a session-lifespan
memory leak.
To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly. If the whole operating system is rebooted, the
old control segment won't exist any more, but that's OK, because none
of the other segments will either - except under the
mmap-a-regular-file implementation, which handles cleanup by scanning
the relevant directory rather than relying on the control segment.
These precautions seem sufficient to ensure that dynamic shared memory
segments can't survive the postmaster itself short of a hard kill, and
that even after a hard kill we'll clean things up on a subsequent
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).
There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch. Second, it doesn't make any
policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.
But, just for testing purposes, I did just that. The attached patch
includes contrib/dsm_demo, which lets you say
dsm_demo_create('something') in one string, and if you pass the return
value to dsm_demo_read() in the same or another session during the
lifetime of the first session, you'll read back the same value you
saved. This is not, by any stretch of the imagination, a
demonstration of the right way to use this facility - but as a crude
unit test, it suffices. Although I'm including it in the patch file,
I would anticipate removing it before commit. Hopefully, with a
little more functionality on top of what's included here, we'll soon
be in a position to build something that might actually be useful to
someone, but this layer itself is a bit too impoverished to build
something really cool, at least not without more work than I wanted to
put in as part of the development of this patch.
Using that crappy contrib module, I verified that the POSIX, System V,
and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and
on Linux (Fedora 16). I wouldn't like to have to wager on having
gotten all of the details right to be absolutely portable everywhere,
so I wouldn't be surprised to see this break on other systems.
Hopefully that will be a matter of adjusting the configure tests a bit
rather than coping with substantive changes in available
functionality, but we'll see.
Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
dynshmem-v1.patchapplication/octet-stream; name=dynshmem-v1.patchDownload
diff --git a/configure b/configure
index d4a544d..e9f7e2c 100755
--- a/configure
+++ b/configure
@@ -8384,6 +8384,180 @@ if test "$ac_res" != no; then
fi
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_open" >&5
+$as_echo_n "checking for library containing shm_open... " >&6; }
+if test "${ac_cv_search_shm_open+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_open ();
+int
+main ()
+{
+return shm_open ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_open=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_open+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_open+set}" = set; then
+ :
+else
+ ac_cv_search_shm_open=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_open" >&5
+$as_echo "$ac_cv_search_shm_open" >&6; }
+ac_res=$ac_cv_search_shm_open
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_unlink" >&5
+$as_echo_n "checking for library containing shm_unlink... " >&6; }
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_unlink ();
+int
+main ()
+{
+return shm_unlink ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_unlink=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_unlink+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ :
+else
+ ac_cv_search_shm_unlink=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_unlink" >&5
+$as_echo "$ac_cv_search_shm_unlink" >&6; }
+ac_res=$ac_cv_search_shm_unlink
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
# Solaris:
{ $as_echo "$as_me:$LINENO: checking for library containing fdatasync" >&5
$as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -19764,7 +19938,8 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
do
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
{ $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
diff --git a/configure.in b/configure.in
index fe25419..415c4e3 100644
--- a/configure.in
+++ b/configure.in
@@ -883,6 +883,8 @@ case $host_os in
esac
AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
AC_SEARCH_LIBS(crypt, crypt)
+AC_SEARCH_LIBS(shm_open, rt)
+AC_SEARCH_LIBS(shm_unlink, rt)
# Solaris:
AC_SEARCH_LIBS(fdatasync, [rt posix4])
# Required for thread_test.c on Solaris 2.5:
@@ -1230,7 +1232,7 @@ PGAC_FUNC_GETTIMEOFDAY_1ARG
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/contrib/dsm_demo/Makefile b/contrib/dsm_demo/Makefile
new file mode 100644
index 0000000..dd9ea92
--- /dev/null
+++ b/contrib/dsm_demo/Makefile
@@ -0,0 +1,17 @@
+# contrib/dsm_demo/Makefile
+
+MODULES = dsm_demo
+
+EXTENSION = dsm_demo
+DATA = dsm_demo--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/dsm_demo
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/dsm_demo/dsm_demo--1.0.sql b/contrib/dsm_demo/dsm_demo--1.0.sql
new file mode 100644
index 0000000..7ad6ab1
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/dsm_demo/dsm_demo--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION dsm_demo" to load this file. \quit
+
+CREATE FUNCTION dsm_demo_create(pg_catalog.text)
+RETURNS pg_catalog.int8 STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION dsm_demo_read(pg_catalog.int8)
+RETURNS pg_catalog.text STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/dsm_demo/dsm_demo.c b/contrib/dsm_demo/dsm_demo.c
new file mode 100644
index 0000000..7f45ccb
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.c
@@ -0,0 +1,97 @@
+/* -------------------------------------------------------------------------
+ *
+ * dsm_demo.c
+ * Dynamic shared memory demonstration.
+ *
+ * Copyright (C) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/dsm_demo/dsm_demo.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+#include "fmgr.h"
+
+PG_MODULE_MAGIC;
+
+void _PG_init(void);
+Datum dsm_demo_create(PG_FUNCTION_ARGS);
+Datum dsm_demo_read(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(dsm_demo_create);
+PG_FUNCTION_INFO_V1(dsm_demo_read);
+
+#define DSM_DEMO_MAGIC 0x44454D4F
+
+typedef struct
+{
+ uint32 magic;
+ int32 len;
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} dsm_demo_payload;
+
+Datum
+dsm_demo_create(PG_FUNCTION_ARGS)
+{
+ text *txt = PG_GETARG_TEXT_PP(0);
+ int len = VARSIZE_ANY(txt);
+ uint64 seglen;
+ dsm_segment *seg;
+ dsm_demo_payload *payload;
+
+ seglen = offsetof(dsm_demo_payload, data) + len;
+ seg = dsm_create(seglen, NULL);
+ dsm_keep_mapping(seg);
+
+ payload = dsm_segment_address(seg);
+ payload->magic = DSM_DEMO_MAGIC;
+ payload->len = len;
+ memcpy(payload->data, txt, len);
+
+ PG_RETURN_INT64(dsm_segment_handle(seg));
+}
+
+Datum
+dsm_demo_read(PG_FUNCTION_ARGS)
+{
+ dsm_handle h = PG_GETARG_INT64(0);
+ dsm_segment *seg;
+ bool needs_detach = false;
+ text *txt = NULL;
+ dsm_demo_payload *payload;
+
+ /*
+ * We could be called from the same sesion that called dsm_demo_create(),
+ * so search for an existing mapping. If we don't find one, attach the
+ * segment.
+ */
+ seg = dsm_find_mapping(h);
+ if (seg == NULL)
+ {
+ seg = dsm_attach(h, NULL);
+ if (!seg)
+ PG_RETURN_NULL();
+ needs_detach = true;
+ }
+
+ /* Extract data, after checking magic number. */
+ payload = dsm_segment_address(seg);
+ if (payload->magic == DSM_DEMO_MAGIC)
+ {
+ txt = palloc(payload->len);
+ memcpy(txt, payload->data, payload->len);
+ }
+
+ /* Detach, if there was no existing mapping. */
+ if (needs_detach)
+ dsm_detach(seg);
+
+ if (txt == NULL)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TEXT_P(txt);
+}
diff --git a/contrib/dsm_demo/dsm_demo.control b/contrib/dsm_demo/dsm_demo.control
new file mode 100644
index 0000000..4060791
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.control
@@ -0,0 +1,5 @@
+# dsm_demo extension
+comment = 'Dynamic shared memory demonstration'
+default_version = '1.0'
+module_pathname = 'dsm_demo'
+relocatable = true
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 20e3c32..b604407 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -29,6 +29,7 @@
#endif
#include "miscadmin.h"
+#include "portability/mem.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
@@ -36,31 +37,6 @@
typedef key_t IpcMemoryKey; /* shared memory key passed to shmget(2) */
typedef int IpcMemoryId; /* shared memory ID returned by shmget(2) */
-#define IPCProtection (0600) /* access/modify by user only */
-
-#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
-#define PG_SHMAT_FLAGS SHM_SHARE_MMU
-#else
-#define PG_SHMAT_FLAGS 0
-#endif
-
-/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
-#ifndef MAP_ANONYMOUS
-#define MAP_ANONYMOUS MAP_ANON
-#endif
-
-/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
-#ifndef MAP_HASSEMAPHORE
-#define MAP_HASSEMAPHORE 0
-#endif
-
-#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
-
-/* Some really old systems don't define MAP_FAILED. */
-#ifndef MAP_FAILED
-#define MAP_FAILED ((void *) -1)
-#endif
-
unsigned long UsedShmemSegID = 0;
void *UsedShmemSegAddr = NULL;
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 743f30e..873dd60 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -15,7 +15,7 @@ override CFLAGS+= -fno-inline
endif
endif
-OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
- sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
+ shmem.o shmqueue.o sinval.o sinvaladt.o standby.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
new file mode 100644
index 0000000..3b0bc54
--- /dev/null
+++ b/src/backend/storage/ipc/dsm.c
@@ -0,0 +1,937 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides a set of services to make programming with dynamic
+ * shared memory segments more convenient. Unlike the low-level
+ * facilities provided by dsm_impl.h and dsm_impl.c, mappings and segments
+ * created using this module will be cleaned up automatically. Mappings
+ * will be removed when the resource owner under which they were created
+ * is cleaned up, unless dsm_keep_mapping() is used, in which case they
+ * have session lifespan. Segments will be removed when there are no
+ * remaining mappings, or at postmaster shutdown in any case. After a
+ * hard postmaster crash, remaining segments will be removed, if they
+ * still exist, at the next postmaster startup.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+
+#include "lib/ilist.h"
+#include "miscadmin.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner_private.h"
+
+#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
+#define PG_DYNSHMEM_STATE_BUFSIZ 512
+#define PG_DYNSHMEM_CONTROL_MAGIC 0x9a503d32
+
+/*
+ * There's no point in getting too cheap here, because the minimum allocation
+ * is one OS page, which is probably at least 4KB and could easily be as high
+ * as 64KB. Each currently sizeof(dsm_control_item), currently 8 bytes.
+ */
+#define PG_DYNSHMEM_FIXED_SLOTS 64
+#define PG_DYNSHMEM_SLOTS_PER_BACKEND 2
+
+#define INVALID_CONTROL_SLOT ((uint32) -1)
+
+struct dsm_segment
+{
+ dlist_node node; /* List link in dsm_segment_list. */
+ ResourceOwner resowner; /* Resource owner. */
+ dsm_handle handle; /* Segment name. */
+ uint32 control_slot; /* Slot in control segment. */
+ void *mapped_address; /* Mapping address, or NULL if unmapped. */
+ uint64 mapped_size; /* Size of our mapping. */
+};
+
+typedef struct dsm_control_item
+{
+ dsm_handle handle;
+ uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */
+} dsm_control_item;
+
+typedef struct dsm_control_header
+{
+ uint32 magic;
+ uint32 nitems;
+ uint32 maxitems;
+ dsm_control_item item[FLEXIBLE_ARRAY_MEMBER];
+} dsm_control_header;
+
+static void dsm_cleanup_using_control_segment(void);
+static void dsm_cleanup_for_mmap(void);
+static bool dsm_read_state_file(dsm_handle *h);
+static void dsm_write_state_file(dsm_handle h);
+static void dsm_postmaster_shutdown(int code, Datum arg);
+static void dsm_backend_shutdown(int code, Datum arg);
+static dsm_segment *dsm_create_descriptor(void);
+static bool dsm_control_segment_sane(dsm_control_header *control,
+ uint64 mapped_size);
+static uint64 dsm_control_bytes_needed(uint32 nitems);
+
+/* Has this backend initialized the dynamic shared memory system yet? */
+static bool dsm_init_done = false;
+
+/*
+ * List of dynamic shared memory segments used by this backend.
+ *
+ * At process exit time, we must decrement the reference count of each
+ * segment we have attached; this list makes it possible to find all such
+ * segments.
+ *
+ * This list should always be empty in the postmaster. We could probably
+ * allow the postmaster to map dynamic shared memory segments before it
+ * begins to start child processes, provided that each process adjusted
+ * the reference counts for those segments in the control segment at
+ * startup time, but there's no obvious need for such a facility, which
+ * would also be complex to handle in the EXEC_BACKEND case. Once the
+ * postmaster has begun spawning children, there's an additional problem:
+ * each new mapping would require an update to the control segment,
+ * which requires locking, in which the postmaster must not be involved.
+ */
+static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list);
+
+/*
+ * Control segment information.
+ *
+ * Unlike ordinary shared memory segments, the control segment is not
+ * reference counted; instead, it lasts for the postmaster's entire
+ * life cycle. For simplicity, it doesn't have a dsm_segment object either.
+ */
+static dsm_handle dsm_control_handle;
+static dsm_control_header *dsm_control;
+static uint64 dsm_control_mapped_size = 0;
+
+/*
+ * Start up the dynamic shared memory system.
+ *
+ * This is called just once during each cluster lifetime, at postmaster
+ * startup time.
+ */
+void
+dsm_postmaster_startup(void)
+{
+ void *dsm_control_address = NULL;
+ uint32 maxitems;
+ uint64 segsize;
+
+ Assert(!IsUnderPostmaster);
+
+ /* If dynamic shared memory is disabled, there's nothing to do. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ return;
+
+ /*
+ * Check for, and remove, shared memory segments left behind by a dead
+ * postmaster.
+ */
+ if (dynamic_shared_memory_type == DSM_IMPL_MMAP)
+ dsm_cleanup_for_mmap();
+ else
+ dsm_cleanup_using_control_segment();
+
+ /* Determine size for new control segment. */
+ maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
+ elog(DEBUG2, "dynamic shared memory system will support %lu segments",
+ (unsigned long) maxitems);
+ segsize = dsm_control_bytes_needed(maxitems);
+
+ /* Create new control segment. */
+ for (;;)
+ {
+ Assert(dsm_control_address == NULL);
+ Assert(dsm_control_mapped_size == 0);
+ dsm_control_handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize, NULL,
+ &dsm_control_address, &dsm_control_mapped_size, ERROR))
+ break;
+ }
+ dsm_control = dsm_control_address;
+ on_shmem_exit(dsm_postmaster_shutdown, 0);
+ elog(DEBUG2, "created dynamic shared memory control segment %lu ("
+ UINT64_FORMAT " bytes)", (unsigned long) dsm_control_handle,
+ segsize);
+ dsm_write_state_file(dsm_control_handle);
+
+ /* Initialize control segment. */
+ dsm_control->magic = PG_DYNSHMEM_CONTROL_MAGIC;
+ dsm_control->nitems = 0;
+ dsm_control->maxitems = maxitems;
+}
+
+/*
+ * Determine whether the control segment from the previous postmaster
+ * invocation still exists. If so, remove the dynamic shared memory
+ * segments to which it refers, and then the control segment itself.
+ */
+static void
+dsm_cleanup_using_control_segment(void)
+{
+ void *mapped_address = NULL;
+ void *junk_mapped_address = NULL;
+ uint64 mapped_size = 0;
+ uint64 junk_mapped_size = 0;
+ uint32 nitems;
+ uint32 i;
+ dsm_handle old_control_handle;
+ dsm_control_header *old_control;
+
+ /*
+ * Read the state file. If it doesn't exist or is empty, there's nothing
+ * more to do.
+ */
+ if (!dsm_read_state_file(&old_control_handle))
+ return;
+
+ /*
+ * Try to attach the segment. If this fails, it probably just means that
+ * the operating system has been rebooted and the segment no longer exists,
+ * or an unrelated proces has used the same shm ID. So just fall out
+ * quietly.
+ */
+ if (!dsm_impl_op(DSM_OP_ATTACH, old_control_handle, 0, NULL,
+ &mapped_address, &mapped_size, DEBUG1))
+ return;
+
+ /*
+ * We've managed to reattach it, but the contents might not be sane.
+ * If they aren't, we disregard the segment after all.
+ */
+ old_control = (dsm_control_header *) mapped_address;
+ if (!dsm_control_segment_sane(old_control, mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, NULL,
+ &mapped_address, &mapped_size, LOG);
+ return;
+ }
+
+ /*
+ * OK, the control segment looks basically valid, so we can get use
+ * it to get a list of segments that need to be removed.
+ */
+ nitems = old_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ if (old_control->item[i].refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = old_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+ (unsigned long) handle);
+
+ /* Destroy the referenced segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, NULL,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Destroy the old control segment, too. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) old_control_handle);
+ dsm_impl_op(DSM_OP_DESTROY, old_control_handle, 0, NULL,
+ &mapped_address, &mapped_size, LOG);
+}
+
+/*
+ * When we're using the mmap shared memory implementation, "shared memory"
+ * segments might even manage to survive an operating system reboot.
+ * But there's no guarantee as to exactly what will survive: some segments
+ * may survive, and others may not, and the contents of some may be out
+ * of date. In particular, the control segment may be out of date, so we
+ * can't rely on it to figure out what to remove. However, since we know
+ * what directory contains the files we used as shared memory, we can simply
+ * scan the directory and blow everything away that shouldn't be there.
+ */
+static void
+dsm_cleanup_for_mmap(void)
+{
+ DIR *dir;
+ struct dirent *dent;
+
+ /* Open the directory; can't use AllocateDir in postmaster. */
+ if ((dir = opendir(PG_DYNSHMEM_DIR)) == NULL)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ PG_DYNSHMEM_DIR)));
+
+ /* Scan for something with a name of the correct format. */
+ while ((dent = readdir(dir)) != NULL)
+ {
+ if (strncmp(dent->d_name, PG_DYNSHMEM_MMAP_FILE_PREFIX,
+ strlen(PG_DYNSHMEM_MMAP_FILE_PREFIX)) == 0)
+ {
+ char buf[MAXPGPATH];
+ snprintf(buf, MAXPGPATH, PG_DYNSHMEM_DIR "/%s", dent->d_name);
+
+ elog(DEBUG2, "removing file \"%s\"", buf);
+
+ /* We found a matching file; so remove it. */
+ if (unlink(buf) != 0)
+ {
+ int save_errno;
+
+ save_errno = errno;
+ closedir(dir);
+ errno = save_errno;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", buf)));
+ }
+ }
+ }
+
+ /* Cleanup complete. */
+ closedir(dir);
+}
+
+/*
+ * Read and parse the state file.
+ *
+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk. In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.
+ */
+static bool
+dsm_read_state_file(dsm_handle *h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes = 0;
+ char *endptr,
+ *s;
+ dsm_handle handle;
+
+ /* Read the state file to get the ID of the old control segment. */
+ statefd = open(PG_DYNSHMEM_STATE_FILE, O_RDONLY);
+ if (statefd < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ }
+ nbytes = read(statefd, statebuf, PG_DYNSHMEM_STATE_BUFSIZ - 1);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ /* make sure buffer is NUL terminated */
+ statebuf[nbytes] = '\0';
+ close(statefd);
+
+ /*
+ * We expect to find the handle of the old control segment here,
+ * on a line by itself.
+ */
+ handle = strtoul(statebuf, &endptr, 10);
+ for (s = endptr; *s == ' ' || *s == '\t'; ++s)
+ ;
+ if (*s != '\n' && *s != '\0')
+ return false;
+
+ /* Looks good. */
+ *h = handle;
+ return true;
+}
+
+/*
+ * Write our control segment handle to the state file, so that if the
+ * postmaster is killed without running it's on_shmem_exit hooks, the
+ * next postmaster can clean things up after restart.
+ */
+static void
+dsm_write_state_file(dsm_handle h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes;
+
+ /* Create or truncate the file. */
+ statefd = open(PG_DYNSHMEM_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
+ if (statefd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+
+ /* Write contents. */
+ snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
+ (unsigned long) dsm_control_handle);
+ nbytes = strlen(statebuf);
+ if (write(statefd, statebuf, nbytes) != nbytes)
+ {
+ if (errno == 0)
+ errno = ENOSPC; /* if no error signalled, assume no space */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ }
+
+ /* Close file. */
+ close(statefd);
+}
+
+/*
+ * At shutdown time, we iterate over the control segment and remove all
+ * remaining dynamic shared memory segments. We avoid throwing errors here;
+ * the postmaster is shutting down either way, and this is just non-critical
+ * resource cleanup.
+ */
+static void
+dsm_postmaster_shutdown(int code, Datum arg)
+{
+ uint32 nitems;
+ uint32 i;
+ void *dsm_control_address;
+ void *junk_mapped_address = NULL;
+ uint64 junk_mapped_size = 0;
+
+ /*
+ * If some other backend exited uncleanly, it might have corrupted the
+ * control segment while it was dying. In that case, we warn and ignore
+ * the contents of the control segment. This may end up leaving behind
+ * stray shared memory segments, but there's not much we can do about
+ * that if the metadata is gone.
+ */
+ nitems = dsm_control->nitems;
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ ereport(LOG,
+ (errmsg("dynamic shared memory control segment is corrupt")));
+ return;
+ }
+
+ /* Remove any remaining segments. */
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = dsm_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+ (unsigned long) handle);
+
+ /* Destroy the segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, NULL,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Remove the control segment itself. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) dsm_control_handle);
+ dsm_control_address = dsm_control;
+ dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0, NULL,
+ &dsm_control_address, &dsm_control_mapped_size, LOG);
+ dsm_control = dsm_control_address;
+
+ /* And, finally, remove the state file. */
+ if (unlink(PG_DYNSHMEM_STATE_FILE) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not unlink file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+}
+
+/*
+ * Prepare this backend for dynamic shared memory usage. Under EXEC_BACKEND,
+ * we must reread the state file and map the control segment; in other cases,
+ * we'll have inherited the postmaster's mapping and global variables.
+ */
+static void
+dsm_backend_startup(void)
+{
+ /* If dynamic shared memory is disabled, reject this. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("dynamic shared memory is disabled"),
+ errhint("Set dynamic_shared_memory_type to a value other than \"one\".")));
+
+#ifdef EXEC_BACKEND
+ {
+ dsm_handle control_handle;
+ uint64 control_size;
+ void *control_address = NULL;
+
+ /* Read the control segment information from the state file. */
+ if (!dsm_read_state_file(&control_handle, &control_size))
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("could not parse dynamic shared memory state file")));
+
+ /* Attach control segment. */
+ dsm_impl_op(DSM_OP_ATTACH, control_handle, 0,
+ NULL, &control_address, &dsm_control_mapped_size, ERROR);
+ dsm_control_handle = control_handle;
+ dsm_control = control_address;
+
+ /* If control segment doesn't look sane, something is badly wrong. */
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, control_handle, 0,
+ NULL, &control_address, &dsm_control_mapped_size,
+ WARNING);
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("dynamic shared memory control segment is not valid")));
+ }
+ }
+#endif
+
+ /* Arrange to detach segments on exit. */
+ on_shmem_exit(dsm_backend_shutdown, 0);
+
+ dsm_init_done = true;
+}
+
+/*
+ * Create a new dynamic shared memory segment.
+ */
+dsm_segment *
+dsm_create(uint64 size, char *preferred_address)
+{
+ dsm_segment *seg = dsm_create_descriptor();
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /* Loop until we find an unused segment identifier. */
+ for (;;)
+ {
+ Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
+ seg->handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, preferred_address,
+ &seg->mapped_address, &seg->mapped_size, ERROR))
+ break;
+ }
+
+ /* Lock the control segment so we can register the new segment. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+
+ /* Search the control segment for an unused slot. */
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ if (dsm_control->item[i].refcnt == 0)
+ {
+ dsm_control->item[i].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[i].refcnt = 2;
+ seg->control_slot = i;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ return seg;
+ }
+ }
+
+ /* Verify that we can support an additional mapping. */
+ if (nitems >= dsm_control->maxitems)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("too many dynamic shared memory segments")));
+
+ /* Enter the handle into a new array slot. */
+ dsm_control->item[nitems].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[nitems].refcnt = 2;
+ seg->control_slot = nitems;
+ dsm_control->nitems++;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ return seg;
+}
+
+/*
+ * Attach a dynamic shared memory segment.
+ *
+ * See comments for dsm_segment_handle() for an explanation of how this
+ * is intended to be used.
+ *
+ * This function will return NULL if the segment isn't known to the system.
+ * This can happen if we're asked to attach the segment, but then everyone
+ * else detaches it (causing it to be destroyed) before we get around to
+ * attaching it.
+ */
+dsm_segment *
+dsm_attach(dsm_handle h, char *preferred_address)
+{
+ dsm_segment *seg;
+ dlist_iter iter;
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /*
+ * Since this is just a debugging cross-check, we could leave it out
+ * altogether, or include it only in assert-enabled builds. But since
+ * the list of attached segments should normally be very short, let's
+ * include it always for right now.
+ *
+ * If you're hitting this error, you probably want to use attempt to
+ * find an existing mapping via dsm_find_mapping() before calling
+ * dsm_attach() to create a new one.
+ */
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ elog(ERROR, "can't attach the same segment more than once");
+ }
+
+ /* Create a new segment descriptor. */
+ seg = dsm_create_descriptor();
+ seg->handle = h;
+
+ /* Bump reference count for this segment in shared memory. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /*
+ * If the reference count is 1, the slot is still in use, but the
+ * segment is in the process of going away. Treat that as if we
+ * didn't find a match.
+ */
+ if (dsm_control->item[i].refcnt == 1)
+ break;
+
+ /* Otherwise, if the descriptor matches, we've found a match. */
+ if (dsm_control->item[i].handle == seg->handle)
+ {
+ dsm_control->item[i].refcnt++;
+ seg->control_slot = i;
+ break;
+ }
+ }
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /*
+ * If we didn't find the handle we're looking for in the control
+ * segment, it probably means that everyone else who had it mapped,
+ * including the original creator, died before we got to this point.
+ * It's up to the caller to decide what to do about that.
+ */
+ if (seg->control_slot == INVALID_CONTROL_SLOT)
+ {
+ dsm_detach(seg);
+ return NULL;
+ }
+
+ /* Here's where we actually try to map the segment. */
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, preferred_address,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg;
+}
+
+/*
+ * At backend shutdown time, detach any segments that are still attached.
+ */
+static void
+dsm_backend_shutdown(int code, Datum arg)
+{
+ while (!dlist_is_empty(&dsm_segment_list))
+ {
+ dsm_segment *seg;
+
+ seg = dlist_head_element(dsm_segment, node, &dsm_segment_list);
+ dsm_detach(seg);
+ }
+}
+
+/*
+ * Resize an existing shared memory segment.
+ *
+ * This may cause the shared memory segment to be remapped at a different
+ * address. For the caller's convenience, we return the mapped address.
+ */
+void *
+dsm_resize(dsm_segment *seg, uint64 size, char *preferred_address)
+{
+ Assert(seg->control_slot != INVALID_CONTROL_SLOT);
+ dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, preferred_address,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+ return seg->mapped_address;
+}
+
+/*
+ * Remap an existing shared memory segment.
+ *
+ * This is intended to be used when some other process has extended the
+ * mapping using dsm_resize(), but we've still only got the initial
+ * portion mapped. Since this might change the address at which the
+ * segment is mapped, we return the new mapped address.
+ */
+void *
+dsm_remap(dsm_segment *seg, char *preferred_address)
+{
+ if (!dsm_impl_can_resize())
+ return seg->mapped_address;
+
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, preferred_address,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg->mapped_address;
+}
+
+/*
+ * Detach from a shared memory segment, destroying the segment if we
+ * remove the last reference.
+ *
+ * This function should never fail. It will often be invoked when aborting
+ * a transaction, and a further error won't serve any purpose. It's not a
+ * complete disaster if we fail to unmap or destroy the segment; it means a
+ * resource leak, but that doesn't necessarily preclude further operations.
+ */
+void
+dsm_detach(dsm_segment *seg)
+{
+ /*
+ * Try to remove the mapping, if one exists. Normally, there will be,
+ * but maybe not, if we failed partway through a create or attach
+ * operation. We remove the mapping before decrementing the reference
+ * count so that the process that sees a zero reference count can be
+ * certain that no remaining mappings exist. Even if this fails, we
+ * pretend that it works, because retrying is likely to fail in the
+ * same way.
+ */
+ if (seg->mapped_address != NULL)
+ {
+ dsm_impl_op(DSM_OP_DETACH, seg->handle, 0, NULL,
+ &seg->mapped_address, &seg->mapped_size, WARNING);
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+ }
+
+ /* Reduce reference count, if we previously increased it. */
+ if (seg->control_slot != INVALID_CONTROL_SLOT)
+ {
+ uint32 refcnt;
+ uint32 control_slot = seg->control_slot;
+
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt > 1);
+ refcnt = --dsm_control->item[control_slot].refcnt;
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /* If new reference count is 1, try to destroy the segment. */
+ if (refcnt == 1)
+ {
+ /*
+ * If we fail to destroy the segment here, or are killed before
+ * we finish doing so, the reference count will remain at 1, which
+ * will mean that nobody else can attach to the segment. At
+ * postmaster shutdown time, or when a new postmaster is started
+ * after a hard kill, another attempt will be made to remove the
+ * segment.
+ *
+ * The main case we're worried about here is being killed by
+ * a signal before we can finish removing the segment. In that
+ * case, it's important to be sure that the segment still gets
+ * removed. If we actually fail to remove the segment for some
+ * other reason, the postmaster may not have any better luck than
+ * we did. There's not much we can do about that, though.
+ */
+ if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, NULL,
+ &seg->mapped_address, &seg->mapped_size, WARNING))
+ {
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt == 1);
+ dsm_control->item[control_slot].refcnt = 0;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ }
+ }
+ }
+
+ /* Clean up our remaining backend-private data structures. */
+ if (seg->resowner != NULL)
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ dlist_delete(&seg->node);
+ pfree(seg);
+}
+
+/*
+ * Keep a dynamic shared memory mapping until end of session.
+ *
+ * By default, mappings are owned by the current resource owner, which
+ * typically means they stick around for the duration of the current query
+ * only.
+ */
+void
+dsm_keep_mapping(dsm_segment *seg)
+{
+ if (seg->resowner != NULL)
+ {
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ seg->resowner = NULL;
+ }
+}
+
+/*
+ * Find an existing mapping for a shared memory segment, if there is one.
+ */
+dsm_segment *
+dsm_find_mapping(dsm_handle h)
+{
+ dlist_iter iter;
+ dsm_segment *seg;
+
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ return seg;
+ }
+
+ return NULL;
+}
+
+/*
+ * Get the address at which a dynamic shared memory segment is mapped.
+ */
+void *
+dsm_segment_address(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_address;
+}
+
+/*
+ * Get the size of a mapping.
+ */
+uint64
+dsm_segment_map_length(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_size;
+}
+
+/*
+ * Get a handle for a mapping.
+ *
+ * To establish communication via dynamic shared memory between two backends,
+ * one of them should first call dsm_create() to establish a new shared
+ * memory mapping. That process should then call dsm_segment_handle() to
+ * obtain a handle for the mapping, and pass that handle to the
+ * coordinating backend via some means (e.g. bgw_main_arg, or via the
+ * main shared memory segment). The recipient, once in position of the
+ * handle, should call dsm_attach().
+ */
+dsm_handle
+dsm_segment_handle(dsm_segment *seg)
+{
+ return seg->handle;
+}
+
+/*
+ * Create a segment descriptor.
+ */
+static dsm_segment *
+dsm_create_descriptor(void)
+{
+ dsm_segment *seg;
+
+ ResourceOwnerEnlargeDSMs(CurrentResourceOwner);
+
+ seg = MemoryContextAlloc(TopMemoryContext, sizeof(dsm_segment));
+ dlist_push_head(&dsm_segment_list, &seg->node);
+
+ /* seg->handle must be initialized by the caller */
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+
+ seg->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberDSM(CurrentResourceOwner, seg);
+
+ return seg;
+}
+
+/*
+ * Sanity check a control segment.
+ *
+ * The goal here isn't to detect everything that could possibly be wrong with
+ * the control segment; there's not enough information for that. Rather, the
+ * goal is to make sure that someone can iterate over the items in the segment
+ * without overrunning the end of the mapping and crashing. We also check
+ * the magic number since, if that's messed up, this may not even be one of
+ * our segments at all.
+ */
+static bool
+dsm_control_segment_sane(dsm_control_header *control, uint64 mapped_size)
+{
+ if (mapped_size < offsetof(dsm_control_header, item))
+ return false; /* Mapped size too short to read header. */
+ if (control->magic != PG_DYNSHMEM_CONTROL_MAGIC)
+ return false; /* Magic number doesn't match. */
+ if (dsm_control_bytes_needed(control->maxitems) > mapped_size)
+ return false; /* Max item count won't fit in map. */
+ if (control->nitems > control->maxitems)
+ return false; /* Overfull. */
+ return true;
+}
+
+/*
+ * Compute the number of control-segment bytes needed to store a given
+ * number of items.
+ */
+static uint64
+dsm_control_bytes_needed(uint32 nitems)
+{
+ return offsetof(dsm_control_header, item)
+ + sizeof(dsm_control_item) * (uint64) nitems;
+}
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
new file mode 100644
index 0000000..fce66a8
--- /dev/null
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -0,0 +1,801 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides low-level APIs for creating and destroying shared
+ * memory segments using several different possible techniques. We refer
+ * to these segments as dynamic because they can be created, altered, and
+ * destroyed at any point during the server life cycle. This is unlike
+ * the main shared memory segment, of which there is always exactly one
+ * and which is always mapped at a fixed address in every PostgreSQL
+ * background process.
+ *
+ * Because not all systems provide the same primitives in this area, nor
+ * do all primitives behave the saem way on all systems, we provide
+ * several implementations of this facility. Many systems implement
+ * POSIX shared memory (shm_open etc.), which is well-suited to our needs
+ * in this area, with the exception that shared memory identifiers live
+ * in a flat system-wide namespace, raising the uncomfortable prospect of
+ * name collisions with other processes (including other copies of
+ * PostgreSQL) running on the same system. Some systems only support
+ * the older System V shared memory interface (shmget etc.) which is
+ * also usable; however, the default allocation limits are often quite
+ * small, and the namespace is even more restricted.
+ *
+ * We also provide an mmap-based shared memory implementation. This may
+ * be useful on systems that provide shared memory via a special-purpose
+ * filesystem; by opting for this implementation, the user can even
+ * control precisely where their shared memory segments are placed. It
+ * can also be used as a fallback for systems where shm_open and shmget
+ * are not available or can't be used for some reason. Of course,
+ * mapping a file residing on an actual spinning disk is a fairly poor
+ * approximation for shared memory because writeback may hurt performance
+ * substantially, but there should be few systems where we must make do
+ * with such poor tools.
+ *
+ * As ever, Windows requires its own implemetation.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#ifdef HAVE_SYS_IPC_H
+#include <sys/ipc.h>
+#endif
+#ifdef HAVE_SYS_SHM_H
+#include <sys/shm.h>
+#endif
+
+#include "portability/mem.h"
+#include "storage/dsm_impl.h"
+#include "storage/fd.h"
+#include "utils/guc.h"
+
+#ifdef USE_DSM_POSIX
+static bool dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_SYSV
+static bool dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_MMAP
+static bool dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+static int errcode_for_dynamic_shared_memory(void);
+
+const struct config_enum_entry dynamic_shared_memory_options[] = {
+#ifdef USE_DSM_POSIX
+ { "posix", DSM_IMPL_POSIX, false},
+#endif
+#ifdef USE_DSM_SYSV
+ { "sysv", DSM_IMPL_SYSV, false},
+#endif
+#ifdef USE_DSM_WINDOWS
+ { "windows", DSM_IMPL_WINDOWS, false},
+#endif
+#ifdef USE_DSM_MMAP
+ { "mmap", DSM_IMPL_MMAP, false},
+#endif
+ { "none", DSM_IMPL_NONE, false},
+ {NULL, 0, false}
+};
+
+/* Implementation selector. */
+int dynamic_shared_memory_type;
+
+/* Size of buffer to be used for zero-filling. */
+#define ZBUFFER_SIZE 8192
+
+/*------
+ * Perform a low-level shared memory operation in a platform-specific way,
+ * as dictated by the selected implementation. Each implementation is
+ * required to implement the following primitives.
+ *
+ * DSM_OP_CREATE. Create a segment whose size is the request_size and
+ * map it, ideally at the preferred address.
+ *
+ * DSM_OP_ATTACH. Map the segment, whose size must be the request_size,
+ * ideally at the preferred address. The segment may already be mapped; any
+ * existing mapping should be removed before creating a new one.
+ *
+ * DSM_OP_DETACH. Unmap the segment.
+ *
+ * DSM_OP_RESIZE. Resize the segment to the given request_size and
+ * remap the segment at that new size, respecting preferred_address if
+ * posisble
+ *
+ * DSM_OP_DESTROY. Unmap the segment, if it is mapped. Destroy the
+ * segment.
+ *
+ * Arguments:
+ * op: The operation to be performed.
+ * handle: The handle of an existing object, or for DSM_OP_CREATE, the
+ * a new handle the caller wants created.
+ * request_size: For DSM_OP_CREATE, the requested size. For DSM_OP_RESIZE,
+ * the new size. Otherwise, 0.
+ * preferred_address: Caller's preference for where to map the segment.
+ * mapped_address: Pointer to start of current mapping; pointer to NULL
+ * if none. Updated with new mapping address.
+ * mapped_size: Pointer to size of current mapping; pointer to 0 if none.
+ * Updated with new mapped size.
+ * elevel: Level at which to log errors.
+ *
+ * Return value: true on success, false on failure. When false is returned,
+ * a message should first be logged at the specified elevel, except in the
+ * case where DSM_OP_CREATE experiences a name collision, which should
+ * silently return false.
+ *-----
+ */
+bool
+dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ Assert(op == DSM_OP_CREATE || op == DSM_OP_RESIZE || request_size == 0);
+ Assert((op != DSM_OP_CREATE && op != DSM_OP_ATTACH) ||
+ (*mapped_address == NULL && *mapped_size == 0));
+
+ switch (dynamic_shared_memory_type)
+ {
+#ifdef USE_DSM_POSIX
+ case DSM_IMPL_POSIX:
+ return dsm_impl_posix(op, handle, request_size, preferred_address,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_SYSV
+ case DSM_IMPL_SYSV:
+ return dsm_impl_sysv(op, handle, request_size, preferred_address,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_MMAP
+ case DSM_IMPL_MMAP:
+ return dsm_impl_mmap(op, handle, request_size, preferred_address,
+ mapped_address, mapped_size, elevel);
+#endif
+ }
+ elog(ERROR, "unexpected dynamic shared memory type: %d",
+ dynamic_shared_memory_type);
+}
+
+/*
+ * Does the current dynamic shared memory implementation support resizing
+ * segments? (The answer here could be platform-dependent in the future,
+ * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently
+ * can't resize segments to anything larger than 256MB that way. For now,
+ * we keep it simple.)
+ */
+bool
+dsm_impl_can_resize(void)
+{
+ switch (dynamic_shared_memory_type)
+ {
+ case DSM_IMPL_NONE:
+ return false;
+ case DSM_IMPL_SYSV:
+ return false;
+ default:
+ return true;
+ }
+}
+
+#ifdef USE_DSM_POSIX
+/*
+ * Operating system primitives to support POSIX shared memory.
+ *
+ * POSIX shared memory segments are created and attached using shm_open()
+ * and shm_unlink(); other operations, such as sizing or mapping the
+ * segment, are performed as if the shared memory segments were files.
+ *
+ * Indeed, on some platforms, they may be implemented that way. While
+ * POSIX shared memory segments seem intended to exist in a flat namespace,
+ * some operating systems may implement them as files, even going so far
+ * to treat a request for /xyz as a request to create a file by that name
+ * in the root directory. Users of such broken platforms should select
+ * a different shared memory implementation.
+ */
+static bool
+dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shm_unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Create new segment or open an existing one for attach or resize.
+ *
+ * Even though we're not going through fd.c, we should be safe against
+ * running out of file descriptors, because of NUM_RESERVED_FDS. We're
+ * only opening one extra descriptor here, and we'll close it before
+ * returning.
+ */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = shm_open(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size != request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(preferred_address, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ close(fd);
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_SYSV
+/*
+ * Operating system primitives to support System V shared memory.
+ *
+ * System V shared memory segments are manipulated using shmget(), shmat(),
+ * shmdt(), and shmctl(). There's no portable way to resize such
+ * segments. As the default allocation limits for System V shared memory
+ * are usually quite low, the POSIX facilities may be preferable; but
+ * those are not supported everywhere.
+ */
+static bool
+dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ key_t key;
+ int ident;
+ char *address;
+ char name[64];
+
+ /* Cache for last shm ID. */
+ static key_t lastkey = IPC_PRIVATE;
+ static int lastident;
+
+ /* Resize is not supported for System V shared memory. */
+ if (op == DSM_OP_RESIZE)
+ {
+ elog(elevel, "System V shared memory segments cannot be resized");
+ return false;
+ }
+
+ /* Since resize isn't supported, reattach is a no-op. */
+ if (op == DSM_OP_ATTACH && *mapped_address != NULL)
+ return true;
+
+ /*
+ * POSIX shared memory and mmap-based shared memory identify segments
+ * with names. To avoid needless error message variation, we use the
+ * handle as the name.
+ */
+ snprintf(name, 64, "%lu", (unsigned long) handle);
+
+ /*
+ * The System V shared memory namespace is very restricted; names are
+ * of type key_t, which is expected to be some sort of integer data type,
+ * but not necessarily the same one as dsm_handle. Since we use
+ * dsm_handle to identify shared memory segments across processes, this
+ * might seem like a problem, but it's really not. If dsm_handle is
+ * bigger than key_t, the cast below might truncate away some bits from
+ * the handle the user-provided, but it'll truncate exactly the same bits
+ * away in exactly the same fashion every time we use that handle, which
+ * is all that really matters. Conversely, if dsm_handle is smaller than
+ * key_t, we won't use the full range of available key space, but that's
+ * no big deal either.
+ *
+ * We do make sure that the key isn't negative, because that might not
+ * be portable.
+ */
+ key = (key_t) handle;
+ if (key < 1) /* avoid compiler warning if type is unsigned */
+ key = -key;
+
+ /*
+ * There's one special key, IPC_PRIVATE, which can't be used. If we end
+ * up with that value by chance during a create operation, just pretend
+ * it already exists, so that caller will retry. If we run into it
+ * anywhere else, the caller has passed a handle that doesn't correspond
+ * to anything we ever created, which should not happen.
+ */
+ if (key == IPC_PRIVATE)
+ {
+ if (op != DSM_OP_CREATE)
+ elog(elevel, "System V shared memory key may not be IPC_PRIVATE");
+ errno = EEXIST;
+ return false;
+ }
+
+ /*
+ * Before we can do anything with a shared memory segment, we have to
+ * map the shared memory key to a shared memory identifier using shmget().
+ * To avoid repeated lookups, we maintain a single-entry cache of the
+ * last identifer we looked up. This should be enough for most cases,
+ * but we can expand it if needed.
+ *
+ * XXX. Probably we should enter identifiers into a hash table on create
+ * or attach and remove them on detach or destroy.
+ */
+ if (key == lastkey && op != DSM_OP_CREATE)
+ ident = lastident;
+ else
+ {
+ int flags = IPCProtection;
+ size_t segsize;
+
+ /*
+ * When using shmget to find an existing segment, we must pass the
+ * size as 0. Passing a non-zero size which is greater than the
+ * actual size will result in EINVAL.
+ */
+ segsize = 0;
+
+ if (op == DSM_OP_CREATE)
+ {
+ flags |= IPC_CREAT | IPC_EXCL;
+ segsize = request_size;
+ }
+
+ if ((ident = shmget(key, segsize, flags)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not get shared memory segment: %m")));
+ return false;
+ }
+ }
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL && shmdt(*mapped_address) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* If we're attaching it, we must use IPC_STAT to determine the size. */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct shmid_ds shm;
+
+ if (shmctl(ident, IPC_STAT, &shm) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = shm.shm_segsz;
+ }
+
+ /* Map it. */
+ address = shmat(ident, NULL, PG_SHMAT_FLAGS);
+ if (address == (void *) -1)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_MMAP
+/*
+ * Operating system primitives to support mmap-based shared memory.
+ *
+ * Calling this "shared memory" is somewhat of a misnomer, because what
+ * we're really doing is creating a bunch of files and mapping them into
+ * our address space. The operating system may feel obliged to
+ * synchronize the contents to disk even if nothing is being paged out,
+ * which will not serve us well. The user can relocate the pg_dynshmem
+ * directory to a ramdisk to avoid this problem, if available.
+ */
+static bool
+dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, PG_DYNSHMEM_DIR "/" PG_DYNSHMEM_MMAP_FILE_PREFIX "%lu",
+ (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* Create new segment or open an existing one for attach or resize. */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = OpenTransientFile(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size > request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ else if (*mapped_size < request_size)
+ {
+ bool zero_fill = true;
+
+#ifdef HAVE_POSIX_FALLOCATE
+ /*
+ * If posix_fallocate() is available and succeeds, then the file is
+ * properly allocated and we don't need to zero-fill it (which is less
+ * efficient). In case of an error, fall back to writing zeros,
+ * because on some platforms posix_fallocate() is available but will
+ * not always succeed in cases where zero-filling will.
+ */
+ if (posix_fallocate(fd, 0, request_size) == 0)
+ zero_fill = false;
+#endif /* HAVE_POSIX_FALLOCATE */
+
+ if (zero_fill)
+ {
+ /*
+ * Allocate a buffer full of zeros.
+ *
+ * Note: palloc zbuffer, instead of just using a local char array,
+ * to ensure it is reasonably well-aligned; this may save a few
+ * cycles transferring data to the kernel.
+ */
+ char *zbuffer = (char *) palloc0(ZBUFFER_SIZE);
+ uint32 remaining = request_size;
+ bool success = true;
+
+ /*
+ * Zero-fill the file. We have to do this the hard way to ensure
+ * that all the file space has really been allocated, so that we
+ * don't later seg fault when accessing the memory mapping. This
+ * is pretty pessimal; hopefully most systems where this is used
+ * have posix_fallocate.
+ *
+ * On some systems, such as MacOS X, posix_fallocate isn't
+ * available, but ftruncate serves the same purpose. It would be
+ * nice to use that when possible, but using it to expand a file
+ * isn't portable.
+ */
+ while (success && remaining > 0)
+ {
+ uint64 goal = remaining;
+
+ if (goal > ZBUFFER_SIZE)
+ goal = ZBUFFER_SIZE;
+ if (write(fd, zbuffer, goal) == goal)
+ remaining -= goal;
+ else
+ success = false;
+ }
+
+ if (!success)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno ? save_errno : ENOSPC;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(preferred_address, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ CloseTransientFile(fd);
+
+ return true;
+}
+#endif
+
+static int
+errcode_for_dynamic_shared_memory()
+{
+ if (errno == EFBIG || errno == ENOMEM)
+ return errcode(ERRCODE_OUT_OF_MEMORY);
+ else
+ return errcode_for_file_access();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a0b741b..040c7aa 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
@@ -249,6 +250,10 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
ShmemBackendArrayAllocation();
#endif
+ /* Initialize dynamic shared memory facilities. */
+ if (!IsUnderPostmaster)
+ dsm_postmaster_startup();
+
/*
* Now give loadable modules a chance to set up their shmem allocations
*/
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 047dfd1..5dd19c3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm_impl.h"
#include "storage/standby.h"
#include "storage/fd.h"
#include "storage/proc.h"
@@ -385,6 +386,7 @@ static const struct config_enum_entry synchronous_commit_options[] = {
*/
extern const struct config_enum_entry wal_level_options[];
extern const struct config_enum_entry sync_method_options[];
+extern const struct config_enum_entry dynamic_shared_memory_options[];
/*
* GUC option variables that are exported from this module
@@ -3326,6 +3328,16 @@ static struct config_enum ConfigureNamesEnum[] =
},
{
+ {"dynamic_shared_memory_type", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the dynamic shared memory implementation used."),
+ NULL
+ },
+ &dynamic_shared_memory_type,
+ DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE, dynamic_shared_memory_options,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Selects the method used for forcing WAL updates to disk."),
NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..c9cea28 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -123,6 +123,13 @@
#work_mem = 1MB # min 64kB
#maintenance_work_mem = 16MB # min 1MB
#max_stack_depth = 2MB # min 100kB
+#dynamic_shared_memory_type = posix # the default is the first option
+ # supported by the operating system:
+ # posix
+ # sysv
+ # windows
+ # mmap
+ # use none to disable dynamic shared memory
# - Disk -
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e7ec393..43542cf 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -98,6 +98,11 @@ typedef struct ResourceOwnerData
int nfiles; /* number of owned temporary files */
File *files; /* dynamically allocated array */
int maxfiles; /* currently allocated array size */
+
+ /* We have built-in support for remembering dynamic shmem segments */
+ int ndsms; /* number of owned shmem segments */
+ dsm_segment **dsms; /* dynamically allocated array */
+ int maxdsms; /* currently allocated array size */
} ResourceOwnerData;
@@ -132,6 +137,7 @@ static void PrintPlanCacheLeakWarning(CachedPlan *plan);
static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
+static void PrintDSMLeakWarning(dsm_segment *seg);
/*****************************************************************************
@@ -271,6 +277,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintRelCacheLeakWarning(owner->relrefs[owner->nrelrefs - 1]);
RelationClose(owner->relrefs[owner->nrelrefs - 1]);
}
+
+ /*
+ * Release dynamic shared memory segments. Note that dsm_detach()
+ * will remove the segment from my list, so I just have to iterate
+ * until there are none.
+ *
+ * As in the preceding cases, warn if there are leftover at commit
+ * time.
+ */
+ while (owner->ndsms > 0)
+ {
+ if (isCommit)
+ PrintDSMLeakWarning(owner->dsms[owner->ndsms - 1]);
+ dsm_detach(owner->dsms[owner->ndsms - 1]);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -438,6 +459,8 @@ ResourceOwnerDelete(ResourceOwner owner)
pfree(owner->snapshots);
if (owner->files)
pfree(owner->files);
+ if (owner->dsms)
+ pfree(owner->dsms);
pfree(owner);
}
@@ -1230,3 +1253,88 @@ PrintFileLeakWarning(File file)
"temporary file leak: File %d still referenced",
file);
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * dynamic shmem segment reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeDSMs(ResourceOwner owner)
+{
+ int newmax;
+
+ if (owner->ndsms < owner->maxdsms)
+ return; /* nothing to do */
+
+ if (owner->dsms == NULL)
+ {
+ newmax = 16;
+ owner->dsms = (dsm_segment **)
+ MemoryContextAlloc(TopMemoryContext,
+ newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+ else
+ {
+ newmax = owner->maxdsms * 2;
+ owner->dsms = (dsm_segment **)
+ repalloc(owner->dsms, newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+}
+
+/*
+ * Remember that a dynamic shmem segment is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeDSMs()
+ */
+void
+ResourceOwnerRememberDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ Assert(owner->ndsms < owner->maxdsms);
+ owner->dsms[owner->ndsms] = seg;
+ owner->ndsms++;
+}
+
+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ dsm_segment **dsms = owner->dsms;
+ int ns1 = owner->ndsms - 1;
+ int i;
+
+ for (i = ns1; i >= 0; i--)
+ {
+ if (dsms[i] == seg)
+ {
+ while (i < ns1)
+ {
+ dsms[i] = dsms[i + 1];
+ i++;
+ }
+ owner->ndsms = ns1;
+ return;
+ }
+ }
+ elog(ERROR,
+ "dynamic shared memory segment %lu is not owned by resource owner %s",
+ (unsigned long) dsm_segment_handle(seg), owner->name);
+}
+
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintDSMLeakWarning(dsm_segment *seg)
+{
+ elog(WARNING,
+ "dynamic shared memory leak: segment %lu still referenced",
+ (unsigned long) dsm_segment_handle(seg));
+}
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f66f530..a6eb0d8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -182,6 +182,7 @@ const char *subdirs[] = {
"pg_xlog",
"pg_xlog/archive_status",
"pg_clog",
+ "pg_dynshmem",
"pg_notify",
"pg_serial",
"pg_snapshots",
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 033127b..c846e63 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -427,6 +427,9 @@
/* Define to 1 if you have the `setsid' function. */
#undef HAVE_SETSID
+/* Define to 1 if you have the `shm_open' function. */
+#undef HAVE_SHM_OPEN
+
/* Define to 1 if you have the `sigprocmask' function. */
#undef HAVE_SIGPROCMASK
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
new file mode 100644
index 0000000..2a07c10
--- /dev/null
+++ b/src/include/portability/mem.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * mem.h
+ * portability definitions for various memory operations
+ *
+ * Copyright (c) 2001-2013, PostgreSQL Global Development Group
+ *
+ * src/include/portability/mem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MEM_H
+#define MEM_H
+
+#define IPCProtection (0600) /* access/modify by user only */
+
+#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
+#define PG_SHMAT_FLAGS SHM_SHARE_MMU
+#else
+#define PG_SHMAT_FLAGS 0
+#endif
+
+/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+
+/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
+#ifndef MAP_HASSEMAPHORE
+#define MAP_HASSEMAPHORE 0
+#endif
+
+#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+
+/* Some really old systems don't define MAP_FAILED. */
+#ifndef MAP_FAILED
+#define MAP_FAILED ((void *) -1)
+#endif
+
+#endif /* MEM_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
new file mode 100644
index 0000000..4c3923d
--- /dev/null
+++ b/src/include/storage/dsm.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.h
+ * manage dynamic shared memory segments
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_H
+#define DSM_H
+
+#include "storage/dsm_impl.h"
+
+typedef struct dsm_segment dsm_segment;
+
+/* Initialization function. */
+extern void dsm_postmaster_startup(void);
+
+/* Functions that create, update, or remove mappings. */
+extern dsm_segment *dsm_create(uint64 size, char *preferred_address);
+extern dsm_segment *dsm_attach(dsm_handle h, char *preferred_address);
+extern void *dsm_resize(dsm_segment *seg, uint64 size,
+ char *preferred_address);
+extern void *dsm_remap(dsm_segment *seg, char *preferred_address);
+extern void dsm_detach(dsm_segment *seg);
+
+/* Resource management functions. */
+extern void dsm_keep_mapping(dsm_segment *seg);
+extern dsm_segment *dsm_find_mapping(dsm_handle h);
+
+/* Informational functions. */
+extern void *dsm_segment_address(dsm_segment *seg);
+extern uint64 dsm_segment_map_length(dsm_segment *seg);
+extern dsm_handle dsm_segment_handle(dsm_segment *seg);
+
+#endif /* DSM_H */
diff --git a/src/include/storage/dsm_impl.h b/src/include/storage/dsm_impl.h
new file mode 100644
index 0000000..177c901
--- /dev/null
+++ b/src/include/storage/dsm_impl.h
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.h
+ * low-level dynamic shared memory primitives
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_IMPL_H
+#define DSM_IMPL_H
+
+/* Dynamic shared memory implementations. */
+#define DSM_IMPL_NONE 0
+#define DSM_IMPL_POSIX 1
+#define DSM_IMPL_SYSV 2
+#define DSM_IMPL_WINDOWS 3
+#define DSM_IMPL_MMAP 4
+
+/*
+ * Determine which dynamic shared memory implementations will be supported
+ * on this platform, and which one will be the default.
+ */
+#ifdef WIN32
+#define USE_DSM_WINDOWS
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_WINDOWS
+#else
+#ifdef HAVE_SHM_OPEN
+#define USE_DSM_POSIX
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_POSIX
+#endif
+#define USE_DSM_SYSV
+#ifndef DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_SYSV
+#endif
+#define USE_DSM_MMAP
+#endif
+
+/* GUC. */
+extern int dynamic_shared_memory_type;
+
+/*
+ * Directory for on-disk state.
+ *
+ * This is used by all implementations for crash recovery and by the mmap
+ * implementation for storage.
+ */
+#define PG_DYNSHMEM_DIR "pg_dynshmem"
+#define PG_DYNSHMEM_MMAP_FILE_PREFIX "mmap."
+
+/* A "name" for a dynamic shared memory segment. */
+typedef uint32 dsm_handle;
+
+/* All the shared-memory operations we know about. */
+typedef enum
+{
+ DSM_OP_CREATE,
+ DSM_OP_ATTACH,
+ DSM_OP_DETACH,
+ DSM_OP_RESIZE,
+ DSM_OP_DESTROY
+} dsm_op;
+
+/* Create, attach to, detach from, resize, or destroy a segment. */
+extern bool dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void *preferred_address, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+
+/* Some implementations cannot resize segments. Can this one? */
+extern bool dsm_impl_can_resize(void);
+
+#endif /* DSM_IMPL_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 39415a3..730c47b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -80,6 +80,7 @@ typedef enum LWLockId
OldSerXidLock,
SyncRepLock,
BackgroundWorkerLock,
+ DynamicSharedMemoryControlLock,
/* Individual lock IDs end here */
FirstBufMappingLock,
FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a5d8707..6693483 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -16,6 +16,7 @@
#ifndef RESOWNER_PRIVATE_H
#define RESOWNER_PRIVATE_H
+#include "storage/dsm.h"
#include "storage/fd.h"
#include "storage/lock.h"
#include "utils/catcache.h"
@@ -80,4 +81,11 @@ extern void ResourceOwnerRememberFile(ResourceOwner owner,
extern void ResourceOwnerForgetFile(ResourceOwner owner,
File file);
+/* support for dynamic shared memory management */
+extern void ResourceOwnerEnlargeDSMs(ResourceOwner owner);
+extern void ResourceOwnerRememberDSM(ResourceOwner owner,
+ dsm_segment *);
+extern void ResourceOwnerForgetDSM(ResourceOwner owner,
+ dsm_segment *);
+
#endif /* RESOWNER_PRIVATE_H */
Hi Robert,
[just sending an email which sat in my outbox for two weeks]
On 2013-08-13 21:09:06 -0400, Robert Haas wrote:
...
Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.
I've not yet looked at the code, so I just have some highlevel comments
so far.
To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly.
That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?
Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type.
I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.
Hm. Not sure what's the advantage over a corefile here.
On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.
Shouldn't we ourselves know which segments are around?
Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
I don't think we should try to work around limits like that.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.
-1 so far. Seems a bit handwavy to me.
Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other.
It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)
There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch.
Agreed.
Second, it doesn't make any> policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.
At this point I am rather unconcerned with this point to be
honest.
--- /dev/null +++ b/src/include/storage/dsm.h @@ -0,0 +1,40 @@ +/*------------------------------------------------------------------------- + * + * dsm.h + * manage dynamic shared memory segments + * + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/dsm.h + * + *------------------------------------------------------------------------- + */ +#ifndef DSM_H +#define DSM_H + +#include "storage/dsm_impl.h" + +typedef struct dsm_segment dsm_segment; + +/* Initialization function. */ +extern void dsm_postmaster_startup(void); + +/* Functions that create, update, or remove mappings. */ +extern dsm_segment *dsm_create(uint64 size, char *preferred_address); +extern dsm_segment *dsm_attach(dsm_handle h, char *preferred_address); +extern void *dsm_resize(dsm_segment *seg, uint64 size, + char *preferred_address); +extern void *dsm_remap(dsm_segment *seg, char *preferred_address); +extern void dsm_detach(dsm_segment *seg);
Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
[just sending an email which sat in my outbox for two weeks]
Thanks for taking a look.
Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.
Yeah, I hope so. The applications may be somewhat limited by the fact
that there are apparently fairly small limits to how many shared
memory segments you can map at the same time. I believe on one system
I looked at (some version of HP-UX?) the limit was 11. So we won't be
able to go nuts with this: using it definitely introduces all kinds of
failure modes that we don't have it today. But it will also let us do
some pretty cool things that we CAN'T do today.
To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly.That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?
The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.
Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type.I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.
That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.Hm. Not sure what's the advantage over a corefile here.
You can look at it while the server's running.
On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.Shouldn't we ourselves know which segments are around?
Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.
Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.I don't think we should try to work around limits like that.
I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person. There are multiple ways to create shared
memory, and they all have different limits. Normally, System V limits
are small, POSIX limits are large, and the inherited-anonymous-mapping
trick we're now using for the main shared memory segment has no limits
at all. It's very common to run into a system where you can allocate
huge numbers of gigabytes of backend-private memory, but if you try to
allocate 64MB of *shared* memory, you get the axe - or maybe not,
depending on which API you use to create it.
I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option. If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.
It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)
OK, well that's another reason I didn't do it that way, then. :-)
At this point I am rather unconcerned with this point to be
honest.
I think that's appropriate; mostly, I wanted to emphasize that the
wisdom of allocating any given amount of shared memory is outside the
scope of this patch, which only aims to provide mechanism, not policy.
Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.
I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer. But even if it isn't reliable, there doesn't
seem to be all that much value in forbidding access to that part of
the OS-provided API. In the world where it's not reliable, it may
still be convenient to map things at the same address when you can, so
that pointers can't be used. Of course you'd have to have some
fallback strategy for when you don't get the same mapping, and maybe
that's painful enough that there's no point after all. Or maybe it's
worth having one code path for relativized pointers and another for
non-relativized pointers.
To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/13/13 8:09 PM, Robert Haas wrote:
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).
To clarify... you're talking something that would intentionally survive postmaster restart? I don't see use for that either...
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).
Ignorant question... is ResourceOwner related to memory contexts? If not, would memory contexts be a better way to handle memory segment cleanup?
There are quite a few problems that this patch does not solve. First,
It also doesn't provide any mechanism for notifying backends of a new segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be useful to have a standard method or three of doing that; perhaps just some convenience functions wrapping the methods mentioned in comments.
Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.
Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared memory is something we've needed forever! :)
Other comments...
+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk. In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.
I'm a bit concerned about this; I know it was possible in older versions for the global shared memory context to be left behind after a crash and needing to clean it up by hand. Dynamic shared mem potentially multiplies that by 100 or more. I think it'd be worth changing dsm_write_state_file so it always writes a new file and then does an atomic mv (or something similar).
+ * If some other backend exited uncleanly, it might have corrupted the
+ * control segment while it was dying. In that case, we warn and ignore
+ * the contents of the control segment. This may end up leaving behind
+ * stray shared memory segments, but there's not much we can do about
+ * that if the metadata is gone.
Similar concern... in this case, would it be possible to always write updates to an un-used slot and then atomically update a pointer? This would be more work than what I suggested above, so maybe just a TODO for now...
Though... is there anything a dying backend could do that would corrupt the control segment to the point that it would screw up segments allocated by other backends and not related to the dead backend? Like marking a slot as not used when it is still in use and isn't associated with the dead backend? (I'm assuming that if a backend dies unexpectedly then all other backends using memory shared with that backend will need to handle themselves accordingly so that we don't need to worry about that in dsm.c.)
I was able to simplify dsm_create a bit (depending on your definition of simplify...) not sure if the community is OK with using an ereport to exit a loop (that could safely go outside the loop though...). In any case, I traded 5 lines of (mostly) duplicate code with an if{} and a break:
+ nitems = dsm_control->nitems;
+ for (i = 0; i <= nitems; ++i) /* Intentionally go one slot past what's currently been allocated */
+ {
+ if (dsm_control->item[i].refcnt == 0)
+ {
+ dsm_control->item[i].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[i].refcnt = 2;
+ seg->control_slot = i;
+ if (i = nitems) /* We hit the end of the list */
+ {
+ /* Verify that we can support an additional mapping. */
+ if (nitems >= dsm_control->maxitems)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("too many dynamic shared memory segments")));
+
+ dsm_control->nitems++;
+ }
+ break;
+ }
+ }
+
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ return seg;
Should this (in dsm_attach)
+ * If you're hitting this error, you probably want to use attempt to
be
+ * If you're hitting this error, you probably want to attempt to
?
Should dsm_impl_op sanity check the arguments after op? I didn't notice checks in the type-specific code but I also didn't read all of it... are we just depending on the OS to sanity-check?
Also, does the GUC code enforce that the GUC must always be something that's supported? If not then the error in dsm_impl_op should be more user-friendly.
I basically stopped reading after dsm_impl_op... the rest of the stuff was rather over my head.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.
The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.
I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.
That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.
But then, it made them fix the issue afaik :P
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.Hm. Not sure what's the advantage over a corefile here.
You can look at it while the server's running.
That's what debuggers are for.
On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.
Shouldn't we ourselves know which segments are around?
Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.
But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.
Another use case is working around an administrator-imposed or
OS-imposed shared memory limit. If you're not allowed to allocate
shared memory, but you are allowed to create files, then this
implementation will let you use whatever facilities we build on top
of dynamic shared memory anyway.I don't think we should try to work around limits like that.
I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person.
"Let's drive users away from sysv shem" is the only one I heard so far ;)
I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]
But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.
If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.
I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.
I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option.
Well, most of them actually give different guarantees, so it makes sense
to have differing performance...
Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.
I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.
I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.
But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.
It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.
To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.
On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.But then, it made them fix the issue afaik :P
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.Hm. Not sure what's the advantage over a corefile here.
You can look at it while the server's running.
That's what debuggers are for.
On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.Shouldn't we ourselves know which segments are around?
Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.Another use case is working around an administrator-imposed or
OS-imposed shared memory limit. If you're not allowed to allocate
shared memory, but you are allowed to create files, then this
implementation will let you use whatever facilities we build on top
of dynamic shared memory anyway.I don't think we should try to work around limits like that.
I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person."Let's drive users away from sysv shem" is the only one I heard so far ;)
I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option.Well, most of them actually give different guarantees, so it makes sense
to have differing performance...Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.
For Windows, I believe we are already doing something similar
(attaching at predefined address) in main shared
memory. It reserves memory at particular address using
pgwin32_ReserveSharedMemoryRegion() before actually
starting (resuming process created in suspend mode) a process and
then after starting backend attaches at same
address (PGSharedMemoryReAttach).
I think one question here is what is use of exposing
preffered_address, to which I can think of only below:
a. Base OS API's provide such provision, then why don't we?
b. While browsing, I found few examples in IBM site where they also
show usage with preferred address.
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm
c. If user wishes to attach segments at same base address, so that
it can access pointers in the memory mapped
file which otherwise would not be possible.
But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim@nasby.net> wrote:
On 8/13/13 8:09 PM, Robert Haas wrote:
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).To clarify... you're talking something that would intentionally survive
postmaster restart? I don't see use for that either...
No, I meant something that would live as long as the postmaster and
die when it dies.
Ignorant question... is ResourceOwner related to memory contexts? If not,
would memory contexts be a better way to handle memory segment cleanup?
Nope. :-)
There are quite a few problems that this patch does not solve. First,
It also doesn't provide any mechanism for notifying backends of a new
segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
useful to have a standard method or three of doing that; perhaps just some
convenience functions wrapping the methods mentioned in comments.
I don't see that as being generally useful. Backends need to know
more than "there's a new segment", and in fact most backends won't
care about most new segments. A background worker needs to know about
the new segment *that it should attach*, but we have bgw_main_arg. If
we end up using this facility for any system-wide purposes, I imagine
we'll do that by storing the segment ID in the main shared memory
segment someplace.
Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
memory is something we've needed forever! :)
Thanks.
Other comments...
+ * If the state file is empty or the contents are garbled, it probably means + * that the operating system rebooted before the data written by the previous + * postmaster made it to disk. In that case, we can just ignore it; any shared + * memory from before the reboot should be gone anyway.I'm a bit concerned about this; I know it was possible in older versions for
the global shared memory context to be left behind after a crash and needing
to clean it up by hand. Dynamic shared mem potentially multiplies that by
100 or more. I think it'd be worth changing dsm_write_state_file so it
always writes a new file and then does an atomic mv (or something similar).
I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that. However, I don't agree that writing the state file in a
different way would improve anything.
+ * If some other backend exited uncleanly, it might have corrupted the + * control segment while it was dying. In that case, we warn and ignore + * the contents of the control segment. This may end up leaving behind + * stray shared memory segments, but there's not much we can do about + * that if the metadata is gone.Similar concern... in this case, would it be possible to always write
updates to an un-used slot and then atomically update a pointer? This would
be more work than what I suggested above, so maybe just a TODO for now...Though... is there anything a dying backend could do that would corrupt the
control segment to the point that it would screw up segments allocated by
other backends and not related to the dead backend? Like marking a slot as
not used when it is still in use and isn't associated with the dead backend?
Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.
Should this (in dsm_attach)
+ * If you're hitting this error, you probably want to use attempt to
be
+ * If you're hitting this error, you probably want to attempt to
?
Good point.
Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are we
just depending on the OS to sanity-check?
Sanity-check for what?
Also, does the GUC code enforce that the GUC must always be something that's
supported? If not then the error in dsm_impl_op should be more
user-friendly.
Yes.
I basically stopped reading after dsm_impl_op... the rest of the stuff was
rather over my head.
:-)
Thanks for your interest!
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:
The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.
Right. I had those considerations in mind and I believe I have nailed
the hatch shut pretty tight. The cleanup code is designed never to
die with an error. Of course it might, but it would have to be
something like an out of memory failure or similar that isn't really
what we're concerned about here. You are welcome to look for holes,
but these issues are where most of my brainpower went during
development.
That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.But then, it made them fix the issue afaik :P
Pah. :-)
You can look at it while the server's running.
That's what debuggers are for.
Tough crowd. I like it. YMMV.
I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.
I see. There might be additional writeback traffic, but it might not
be that bad in common cases. After all the data's pretty hot.
If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.
OK, how about this.... SysV doesn't allow extending segments, but
mmap does. The thing here is that you're saying "remove mmap and keep
sysv" but Noah suggested to me that we remove sysv and keep mmap.
This suggests to me that the picture is not so black and white as you
think it is.
I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.
Noah?
But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.
Maybe. I think for the amount of code we're talking about here, it's
not worth getting excited about.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 31, 2013 at 08:27:14AM -0400, Robert Haas wrote:
On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I somewhat disfavor having a vague "preferred_address" parameter. mmap()'s
first argument is specified that way, but mmap()'s specification caters to an
open-ended range of implementations and clients. A PostgreSQL backend
interface can be more rigid. If we choose to support fixed-address callers,
let those receive either the requested address or an ereport(ERROR). If the
caller does not care, make no effort to provide a consistent address. (Better
still, under --enable-cassert, try to force the address to differ across
processes.)
[quotations reordered]
But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
That's also valid, though. Even if no core code exploits the flexibility,
3rd-party code might do so.
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.
I agree. It would take an exceptional use case to justify such parallel code
paths; I won't expect that to ever happen for core code.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.Noah?
The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.
On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.
A system where mmap() accepts only a zero address in practice (HP-UX,
according to Gnulib, although HP docs suggest it has improved over time)
requires a different technique. For those systems, expand the regular shared
memory segment and carve from that to make "dynamic" segments. This amounts
to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
use this implementation, its disadvantages would be sufficient to discard the
whole idea of reliable fixed addresses. But I find it acceptable if it's a
crutch for older kernels, rare hardware, etc.
I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Noah,
On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. InThat's also valid, though. Even if no core code exploits the flexibility,
3rd-party code might do so.
It seems more likely that 3rd party code misunderstands the
limitations. But perhaps that's being too picky.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.Noah?
The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.
Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair
On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.
Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...
If that's the solution we go for why don't we just always include heaps
more shared memory and use that (remapping part of it as PROT_NONE)
instead of building the infrastructure in this patch?
A system where mmap() accepts only a zero address in practice (HP-UX,
according to Gnulib, although HP docs suggest it has improved over time)
requires a different technique. For those systems, expand the regular shared
memory segment and carve from that to make "dynamic" segments. This amounts
to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
use this implementation, its disadvantages would be sufficient to discard the
whole idea of reliable fixed addresses. But I find it acceptable if it's a
crutch for older kernels, rare hardware, etc.I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.
If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.
If you go for allocating from the start what you end up will not
actually be "dynamic shared memory" because you cannot allocate much
(even if not actually backed by memory!) without compromising
performance. So we will end up with a configurable
"dynamic_shared_memory = ..." parameter. And even then, all processes,
even those not actually using the "dynamic" memory, will have to pay the
price of not being able to allocate as much memory as otherwise possible.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair
Non-Windows EXEC_BACKEND is already facing a dead end that way.
On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...
I believe you, but I'd appreciate a link to the discussion you have in mind.
If that's the solution we go for why don't we just always include heaps
more shared memory and use that (remapping part of it as PROT_NONE)
instead of building the infrastructure in this patch?
There would be no freeing of the memory; a usage high water mark would stand
for the life of the postmaster.
I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.
True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.
I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Noah!
On 2013-09-01 12:07:04 -0400, Noah Misch wrote:
On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. AfairNon-Windows EXEC_BACKEND is already facing a dead end that way.
Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.
On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...I believe you, but I'd appreciate a link to the discussion you have in mind.
Unfortunately I could only find the first half of the discussion about
the issue. Turns out it's not the greatest idea to name your fancy new
programming language "go" (yesyes, petpeeve of mine).
http://lkml.org/lkml/2011/2/8/118
https://lwn.net/Articles/428100/
So, after reading up on the issue a bit more and reading some more
kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
problems except counting in ulimit -v. It will *not* cause overcommit
violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
faulted. Which means that to be reliable and not violate overcommit we'd
need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
immediately (without interceding mallocs, using mmap itself) map it again.
It only gets really expensive in the sense of making fork expensive if
you set protections on many regions in that mapping individually. Each
mprotect() call will split the VMA into distinct pieces and they won't
get merged even if there are neighboors with the same settings.
I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.
Not sure. I think that will partially depend on whether x32 will have
any success which I still find hard to judge.
I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.
I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.
In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 2, 2013 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.
I'm not sure what Noah was getting at, but I have used EXEC_BACKEND
twice now during development, in situations where I would have needed
a Windows development otherwise. So it's definitely useful, at least
to me. But on my MacBook Pro, you have to compile it with -fno-pie (I
think that's the right flag) to disable ASLR in order to get reliable
operation. I imagine such problems will become commonplace on more
and more platforms as time wears on.
I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.
I didn't intend to punt that decision to another layer so much as
another patch and a more detailed examination of requirements. IME,
given a choice between something that is 99% reliable and provides
more functionality, or something that is 99.99% reliable and provides
less functionality, this community picks the latter every time. And
that's why I've left out any capability to insist on a fixed address
from this patch. It would be nice to have, to be sure. But it also
would take more work and add more complexity, and I don't have a clear
sense that that work would be justified.
Now, we might get to a point where it seems clear that we're not going
to get any further with parallelism without adding a capability for
fixed-address mappings. If that happens, I think that's the time to
come back to this layer and add that capability. But right now it
doesn't seem essential. Now, having said that, I didn't see any
particular reason to bury the ability to pass mmap() or shmat() a
*preferred* address. But IJWH.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 03, 2013 at 12:52:22AM +0200, Andres Freund wrote:
On 2013-09-01 12:07:04 -0400, Noah Misch wrote:
On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. AfairNon-Windows EXEC_BACKEND is already facing a dead end that way.
Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.
It already has problems: ASLR measures sometimes prevent reattachment of the
main shared memory segment. Multiplying the combined size of our
fixed-address mappings does not push us over some threshold where this becomes
a problem, because it is already a problem.
Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards.
So, after reading up on the issue a bit more and reading some more
kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
problems except counting in ulimit -v. It will *not* cause overcommit
violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
faulted. Which means that to be reliable and not violate overcommit we'd
need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
immediately (without interceding mallocs, using mmap itself) map it again.It only gets really expensive in the sense of making fork expensive if
you set protections on many regions in that mapping individually. Each
mprotect() call will split the VMA into distinct pieces and they won't
get merged even if there are neighboors with the same settings.
Thanks for researching that.
I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.Not sure. I think that will partially depend on whether x32 will have
any success which I still find hard to judge.
I won't hold my breath for x32 becoming a common platform for high-memory
database servers, regardless of other successes it might find. Not
impossible, but I recommend placing trivial priority on maximizing performance
for that scenario.
I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.
I agree.
In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.
We don't need fixed addresses. Parallel internal sort will probably include
the equivalent of a SortTuple array in its shared memory segment, and that
implies relative pointers to the tuples also stored in shared memory. I
expect that wart to be fairly isolated within the code, so little harm done.
I don't think we will have at all painted ourselves into a corner, should we
wish to lift the limitation later.
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/31/13 7:17 AM, Robert Haas wrote:
On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim@nasby.net> wrote:
On 8/13/13 8:09 PM, Robert Haas wrote:
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).To clarify... you're talking something that would intentionally survive
postmaster restart? I don't see use for that either...No, I meant something that would live as long as the postmaster and
die when it dies.
ISTM that at some point we'll want to look at putting top-level shared memory into this system (ie: allowing dynamic resizing of GUCs that affect shared memory size).
But as you said, it'd be trivial to add that later.
Other comments...
+ * If the state file is empty or the contents are garbled, it probably means + * that the operating system rebooted before the data written by the previous + * postmaster made it to disk. In that case, we can just ignore it; any shared + * memory from before the reboot should be gone anyway.I'm a bit concerned about this; I know it was possible in older versions for
the global shared memory context to be left behind after a crash and needing
to clean it up by hand. Dynamic shared mem potentially multiplies that by
100 or more. I think it'd be worth changing dsm_write_state_file so it
always writes a new file and then does an atomic mv (or something similar).I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that. However, I don't agree that writing the state file in a
different way would improve anything.
Wouldn't it protect against a crash while writing the file? I realize the odds of that are pretty remote, but AFAIK it wouldn't cost that much to write a new file and do an atomic mv...
+ * If some other backend exited uncleanly, it might have corrupted the + * control segment while it was dying. In that case, we warn and ignore + * the contents of the control segment. This may end up leaving behind + * stray shared memory segments, but there's not much we can do about + * that if the metadata is gone.Similar concern... in this case, would it be possible to always write
updates to an un-used slot and then atomically update a pointer? This would
be more work than what I suggested above, so maybe just a TODO for now...Though... is there anything a dying backend could do that would corrupt the
control segment to the point that it would screw up segments allocated by
other backends and not related to the dead backend? Like marking a slot as
not used when it is still in use and isn't associated with the dead backend?Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.
Are we trying to protect against "memory stomps" when we restart after a backend dies? I thought we were just trying to ensure that all shared data structures were correct and consistent. If that's the case, then I was thinking that by using a pointer that can be updated in a CPU-atomic fashion we know we'd never end up with a corrupted entry that was in use; the partial write would be to a slot with nothing pointing at it so it could be safely reused.
Like I said before though, it may not be worth worrying about this case right now.
Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are we
just depending on the OS to sanity-check?Sanity-check for what?
Presumably there's limits to what the arguments can be rationally set to. IIRC there's nothing down-stream that's checking them in our code, so I'm guessing we're just depending on the kernel to sanity-check.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 4, 2013 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote:
No, I meant something that would live as long as the postmaster and
die when it dies.ISTM that at some point we'll want to look at putting top-level shared
memory into this system (ie: allowing dynamic resizing of GUCs that affect
shared memory size).
A lot of people want that, but being able to resize the shared memory
chunk itself is only the beginning of the problem. So I wouldn't hold
my breath.
Wouldn't it protect against a crash while writing the file? I realize the
odds of that are pretty remote, but AFAIK it wouldn't cost that much to
write a new file and do an atomic mv...
If there's an OS-level crash, we don't need the state file; the shared
memory will be gone anyway. And if it's a PostgreSQL-level failure,
this game neither helps nor hurts.
Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.Are we trying to protect against "memory stomps" when we restart after a
backend dies? I thought we were just trying to ensure that all shared data
structures were correct and consistent. If that's the case, then I was
thinking that by using a pointer that can be updated in a CPU-atomic fashion
we know we'd never end up with a corrupted entry that was in use; the
partial write would be to a slot with nothing pointing at it so it could be
safely reused.
When we restart after a backend dies, shared memory contents are
completely reset, from scratch. This is true of both the fixed size
shared memory segment and of the dynamic shared memory control
segment. The only difference is that, with the dynamic shared memory
control segment, we need to use the segment for cleanup before
throwing it out and starting over. Extra caution is required because
we're examining memory that could hypothetically have been stomped on;
we must not let the postmaster do anything suicidal.
Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are
we
just depending on the OS to sanity-check?Sanity-check for what?
Presumably there's limits to what the arguments can be rationally set to.
IIRC there's nothing down-stream that's checking them in our code, so I'm
guessing we're just depending on the kernel to sanity-check.
Pretty much. It's possible more thought is needed there, but the
shape of those additional thoughts is not clear to me at this time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/5/13 11:37 AM, Robert Haas wrote:
ISTM that at some point we'll want to look at putting top-level shared
memory into this system (ie: allowing dynamic resizing of GUCs that affect
shared memory size).A lot of people want that, but being able to resize the shared memory
chunk itself is only the beginning of the problem. So I wouldn't hold
my breath.
<starts breathing again>
Wouldn't it protect against a crash while writing the file? I realize the
odds of that are pretty remote, but AFAIK it wouldn't cost that much to
write a new file and do an atomic mv...If there's an OS-level crash, we don't need the state file; the shared
memory will be gone anyway. And if it's a PostgreSQL-level failure,
this game neither helps nor hurts.Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.Are we trying to protect against "memory stomps" when we restart after a
backend dies? I thought we were just trying to ensure that all shared data
structures were correct and consistent. If that's the case, then I was
thinking that by using a pointer that can be updated in a CPU-atomic fashion
we know we'd never end up with a corrupted entry that was in use; the
partial write would be to a slot with nothing pointing at it so it could be
safely reused.When we restart after a backend dies, shared memory contents are
completely reset, from scratch. This is true of both the fixed size
shared memory segment and of the dynamic shared memory control
segment. The only difference is that, with the dynamic shared memory
control segment, we need to use the segment for cleanup before
throwing it out and starting over. Extra caution is required because
we're examining memory that could hypothetically have been stomped on;
we must not let the postmaster do anything suicidal.
Not doing something suicidal is what I'm worried about (that and not cleaning up as well as possible).
The specific scenario I'm worried about is something like a PANIC in the middle of the snprintf call in dsm_write_state_file(). That would leave that file in a completely unknown state so who knows what would then happen on restart. ISTM that writing a temp file and then doing a filesystem mv would eliminate that issue.
Or is it safe to assume that the snprintf call will be atomic since we're just spitting out a long?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 6, 2013 at 3:40 PM, Jim Nasby <jim@nasby.net> wrote:
The specific scenario I'm worried about is something like a PANIC in the
middle of the snprintf call in dsm_write_state_file(). That would leave that
file in a completely unknown state so who knows what would then happen on
restart. ISTM that writing a temp file and then doing a filesystem mv would
eliminate that issue.
Doing an atomic rename would eliminate the possibility of seeing a
partially written file, but a partially written file is mostly
harmless: we'll interpret whatever bytes we see as as integer and try
to use that as a DSM key. Then we'll just see that no such shared
memory key exists (probably) or that we don't own it (probably) or
that it doesn't look like a valid control segment (probably) and
ignore it.
If someone does a kill -9 the postmaster in the middle of write()
creating a partially written file, and the partially written file
happens to identify another shared memory segment owned by the same
user ID with the correct magic number and header contents to be
interpreted as a control segment, then we will indeed erroneously blow
away that purported control segment and all other segments to which it
points. I suppose we can stick in a rename() there just to completely
rule out that scenario, but it's pretty bloody unlikely anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
OK, here's v2 of this patch, by myself and Amit Kapila. We made the
following changes:
- Added support for Windows. This necessitated adding an impl_private
parameter to dsm_impl_op.
- Since we had impl_private anyway, I used it to implement shm
identifier caching for the System V implementation. I like how that
turned out better than the previous version; YMMV.
- Fixed typo noted by Jim Nasby.
- Removed preferred_address parameter, per griping from Andres and Noah.
- Removed use of posix_fallocate, per recent commits.
- Added use of rename() so that we won't ever see a partially-written
state file, per griping by Jim Nasby.
- Added an overflow check so that if a user of a 32-bit system asks
for 4.1GB of dynamic shared memory, they get an error instead of
getting .1GB of memory.
Despite Andres's comments, I did not remove the mmap implementation or
the GUC that allows users to select which implementation they care to
use. I still think those things are useful. While I appreciate that
there's a marginal cost in complexity to each new GUC, I also don't
think it pays to get too cheap. There's a difference between not
requiring users to configure things that they shouldn't have to
configure, and not letting them configure things they might want to
configure. Besides, having a GUC also provides a way of turning the
feature completely off, which seems justified, at least for now, on
the basis of the (ahem) limited number of users of this facility.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
dynshmem-v2.patchapplication/octet-stream; name=dynshmem-v2.patchDownload
diff --git a/configure b/configure
index c685ca3..97d2f68 100755
--- a/configure
+++ b/configure
@@ -8384,6 +8384,180 @@ if test "$ac_res" != no; then
fi
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_open" >&5
+$as_echo_n "checking for library containing shm_open... " >&6; }
+if test "${ac_cv_search_shm_open+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_open ();
+int
+main ()
+{
+return shm_open ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_open=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_open+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_open+set}" = set; then
+ :
+else
+ ac_cv_search_shm_open=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_open" >&5
+$as_echo "$ac_cv_search_shm_open" >&6; }
+ac_res=$ac_cv_search_shm_open
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_unlink" >&5
+$as_echo_n "checking for library containing shm_unlink... " >&6; }
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_unlink ();
+int
+main ()
+{
+return shm_unlink ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_unlink=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_unlink+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ :
+else
+ ac_cv_search_shm_unlink=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_unlink" >&5
+$as_echo "$ac_cv_search_shm_unlink" >&6; }
+ac_res=$ac_cv_search_shm_unlink
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
# Solaris:
{ $as_echo "$as_me:$LINENO: checking for library containing fdatasync" >&5
$as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -19763,7 +19937,8 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
do
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
{ $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
diff --git a/configure.in b/configure.in
index 82771bd..ead0908 100644
--- a/configure.in
+++ b/configure.in
@@ -883,6 +883,8 @@ case $host_os in
esac
AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
AC_SEARCH_LIBS(crypt, crypt)
+AC_SEARCH_LIBS(shm_open, rt)
+AC_SEARCH_LIBS(shm_unlink, rt)
# Solaris:
AC_SEARCH_LIBS(fdatasync, [rt posix4])
# Required for thread_test.c on Solaris 2.5:
@@ -1230,7 +1232,7 @@ PGAC_FUNC_GETTIMEOFDAY_1ARG
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/contrib/dsm_demo/Makefile b/contrib/dsm_demo/Makefile
new file mode 100644
index 0000000..dd9ea92
--- /dev/null
+++ b/contrib/dsm_demo/Makefile
@@ -0,0 +1,17 @@
+# contrib/dsm_demo/Makefile
+
+MODULES = dsm_demo
+
+EXTENSION = dsm_demo
+DATA = dsm_demo--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/dsm_demo
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/dsm_demo/dsm_demo--1.0.sql b/contrib/dsm_demo/dsm_demo--1.0.sql
new file mode 100644
index 0000000..7ad6ab1
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/dsm_demo/dsm_demo--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION dsm_demo" to load this file. \quit
+
+CREATE FUNCTION dsm_demo_create(pg_catalog.text)
+RETURNS pg_catalog.int8 STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION dsm_demo_read(pg_catalog.int8)
+RETURNS pg_catalog.text STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/dsm_demo/dsm_demo.c b/contrib/dsm_demo/dsm_demo.c
new file mode 100644
index 0000000..0ebbd68
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.c
@@ -0,0 +1,97 @@
+/* -------------------------------------------------------------------------
+ *
+ * dsm_demo.c
+ * Dynamic shared memory demonstration.
+ *
+ * Copyright (C) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/dsm_demo/dsm_demo.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+#include "fmgr.h"
+
+PG_MODULE_MAGIC;
+
+void _PG_init(void);
+Datum dsm_demo_create(PG_FUNCTION_ARGS);
+Datum dsm_demo_read(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(dsm_demo_create);
+PG_FUNCTION_INFO_V1(dsm_demo_read);
+
+#define DSM_DEMO_MAGIC 0x44454D4F
+
+typedef struct
+{
+ uint32 magic;
+ int32 len;
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} dsm_demo_payload;
+
+Datum
+dsm_demo_create(PG_FUNCTION_ARGS)
+{
+ text *txt = PG_GETARG_TEXT_PP(0);
+ int len = VARSIZE_ANY(txt);
+ uint64 seglen;
+ dsm_segment *seg;
+ dsm_demo_payload *payload;
+
+ seglen = offsetof(dsm_demo_payload, data) + len;
+ seg = dsm_create(seglen);
+ dsm_keep_mapping(seg);
+
+ payload = dsm_segment_address(seg);
+ payload->magic = DSM_DEMO_MAGIC;
+ payload->len = len;
+ memcpy(payload->data, txt, len);
+
+ PG_RETURN_INT64(dsm_segment_handle(seg));
+}
+
+Datum
+dsm_demo_read(PG_FUNCTION_ARGS)
+{
+ dsm_handle h = PG_GETARG_INT64(0);
+ dsm_segment *seg;
+ bool needs_detach = false;
+ text *txt = NULL;
+ dsm_demo_payload *payload;
+
+ /*
+ * We could be called from the same sesion that called dsm_demo_create(),
+ * so search for an existing mapping. If we don't find one, attach the
+ * segment.
+ */
+ seg = dsm_find_mapping(h);
+ if (seg == NULL)
+ {
+ seg = dsm_attach(h);
+ if (!seg)
+ PG_RETURN_NULL();
+ needs_detach = true;
+ }
+
+ /* Extract data, after checking magic number. */
+ payload = dsm_segment_address(seg);
+ if (payload->magic == DSM_DEMO_MAGIC)
+ {
+ txt = palloc(payload->len);
+ memcpy(txt, payload->data, payload->len);
+ }
+
+ /* Detach, if there was no existing mapping. */
+ if (needs_detach)
+ dsm_detach(seg);
+
+ if (txt == NULL)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TEXT_P(txt);
+}
diff --git a/contrib/dsm_demo/dsm_demo.control b/contrib/dsm_demo/dsm_demo.control
new file mode 100644
index 0000000..4060791
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.control
@@ -0,0 +1,5 @@
+# dsm_demo extension
+comment = 'Dynamic shared memory demonstration'
+default_version = '1.0'
+module_pathname = 'dsm_demo'
+relocatable = true
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 20e3c32..b604407 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -29,6 +29,7 @@
#endif
#include "miscadmin.h"
+#include "portability/mem.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
@@ -36,31 +37,6 @@
typedef key_t IpcMemoryKey; /* shared memory key passed to shmget(2) */
typedef int IpcMemoryId; /* shared memory ID returned by shmget(2) */
-#define IPCProtection (0600) /* access/modify by user only */
-
-#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
-#define PG_SHMAT_FLAGS SHM_SHARE_MMU
-#else
-#define PG_SHMAT_FLAGS 0
-#endif
-
-/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
-#ifndef MAP_ANONYMOUS
-#define MAP_ANONYMOUS MAP_ANON
-#endif
-
-/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
-#ifndef MAP_HASSEMAPHORE
-#define MAP_HASSEMAPHORE 0
-#endif
-
-#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
-
-/* Some really old systems don't define MAP_FAILED. */
-#ifndef MAP_FAILED
-#define MAP_FAILED ((void *) -1)
-#endif
-
unsigned long UsedShmemSegID = 0;
void *UsedShmemSegAddr = NULL;
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 743f30e..873dd60 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -15,7 +15,7 @@ override CFLAGS+= -fno-inline
endif
endif
-OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
- sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
+ shmem.o shmqueue.o sinval.o sinvaladt.o standby.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
new file mode 100644
index 0000000..60c239f
--- /dev/null
+++ b/src/backend/storage/ipc/dsm.c
@@ -0,0 +1,975 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides a set of services to make programming with dynamic
+ * shared memory segments more convenient. Unlike the low-level
+ * facilities provided by dsm_impl.h and dsm_impl.c, mappings and segments
+ * created using this module will be cleaned up automatically. Mappings
+ * will be removed when the resource owner under which they were created
+ * is cleaned up, unless dsm_keep_mapping() is used, in which case they
+ * have session lifespan. Segments will be removed when there are no
+ * remaining mappings, or at postmaster shutdown in any case. After a
+ * hard postmaster crash, remaining segments will be removed, if they
+ * still exist, at the next postmaster startup.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
+#include <sys/stat.h>
+
+#include "lib/ilist.h"
+#include "miscadmin.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner_private.h"
+
+#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
+#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
+#define PG_DYNSHMEM_STATE_BUFSIZ 512
+#define PG_DYNSHMEM_CONTROL_MAGIC 0x9a503d32
+
+/*
+ * There's no point in getting too cheap here, because the minimum allocation
+ * is one OS page, which is probably at least 4KB and could easily be as high
+ * as 64KB. Each currently sizeof(dsm_control_item), currently 8 bytes.
+ */
+#define PG_DYNSHMEM_FIXED_SLOTS 64
+#define PG_DYNSHMEM_SLOTS_PER_BACKEND 2
+
+#define INVALID_CONTROL_SLOT ((uint32) -1)
+
+struct dsm_segment
+{
+ dlist_node node; /* List link in dsm_segment_list. */
+ ResourceOwner resowner; /* Resource owner. */
+ dsm_handle handle; /* Segment name. */
+ uint32 control_slot; /* Slot in control segment. */
+ void *impl_private; /* Implementation-specific private data. */
+ void *mapped_address; /* Mapping address, or NULL if unmapped. */
+ uint64 mapped_size; /* Size of our mapping. */
+};
+
+typedef struct dsm_control_item
+{
+ dsm_handle handle;
+ uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */
+} dsm_control_item;
+
+typedef struct dsm_control_header
+{
+ uint32 magic;
+ uint32 nitems;
+ uint32 maxitems;
+ dsm_control_item item[FLEXIBLE_ARRAY_MEMBER];
+} dsm_control_header;
+
+static void dsm_cleanup_using_control_segment(void);
+static void dsm_cleanup_for_mmap(void);
+static bool dsm_read_state_file(dsm_handle *h);
+static void dsm_write_state_file(dsm_handle h);
+static void dsm_postmaster_shutdown(int code, Datum arg);
+static void dsm_backend_shutdown(int code, Datum arg);
+static dsm_segment *dsm_create_descriptor(void);
+static bool dsm_control_segment_sane(dsm_control_header *control,
+ uint64 mapped_size);
+static uint64 dsm_control_bytes_needed(uint32 nitems);
+
+/* Has this backend initialized the dynamic shared memory system yet? */
+static bool dsm_init_done = false;
+
+/*
+ * List of dynamic shared memory segments used by this backend.
+ *
+ * At process exit time, we must decrement the reference count of each
+ * segment we have attached; this list makes it possible to find all such
+ * segments.
+ *
+ * This list should always be empty in the postmaster. We could probably
+ * allow the postmaster to map dynamic shared memory segments before it
+ * begins to start child processes, provided that each process adjusted
+ * the reference counts for those segments in the control segment at
+ * startup time, but there's no obvious need for such a facility, which
+ * would also be complex to handle in the EXEC_BACKEND case. Once the
+ * postmaster has begun spawning children, there's an additional problem:
+ * each new mapping would require an update to the control segment,
+ * which requires locking, in which the postmaster must not be involved.
+ */
+static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list);
+
+/*
+ * Control segment information.
+ *
+ * Unlike ordinary shared memory segments, the control segment is not
+ * reference counted; instead, it lasts for the postmaster's entire
+ * life cycle. For simplicity, it doesn't have a dsm_segment object either.
+ */
+static dsm_handle dsm_control_handle;
+static dsm_control_header *dsm_control;
+static uint64 dsm_control_mapped_size = 0;
+static void *dsm_control_impl_private = NULL;
+
+/*
+ * Start up the dynamic shared memory system.
+ *
+ * This is called just once during each cluster lifetime, at postmaster
+ * startup time.
+ */
+void
+dsm_postmaster_startup(void)
+{
+ void *dsm_control_address = NULL;
+ uint32 maxitems;
+ uint64 segsize;
+
+ Assert(!IsUnderPostmaster);
+
+ /* If dynamic shared memory is disabled, there's nothing to do. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ return;
+
+ /*
+ * Check for, and remove, shared memory segments left behind by a dead
+ * postmaster. This isn't necessary on Windows, which always removes them
+ * when the last reference is gone.
+ */
+ switch (dynamic_shared_memory_type)
+ {
+ case DSM_IMPL_POSIX:
+ case DSM_IMPL_SYSV:
+ dsm_cleanup_using_control_segment();
+ break;
+ case DSM_IMPL_MMAP:
+ dsm_cleanup_for_mmap();
+ break;
+ case DSM_IMPL_WINDOWS:
+ /* Nothing to do. */
+ break;
+ default:
+ elog(ERROR, "unknown dynamic shared memory type: %d",
+ dynamic_shared_memory_type);
+ }
+
+ /* Determine size for new control segment. */
+ maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
+ elog(DEBUG2, "dynamic shared memory system will support %lu segments",
+ (unsigned long) maxitems);
+ segsize = dsm_control_bytes_needed(maxitems);
+
+ /* Create new control segment. */
+ for (;;)
+ {
+ Assert(dsm_control_address == NULL);
+ Assert(dsm_control_mapped_size == 0);
+ dsm_control_handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
+ &dsm_control_impl_private, &dsm_control_address,
+ &dsm_control_mapped_size, ERROR))
+ break;
+ }
+ dsm_control = dsm_control_address;
+ on_shmem_exit(dsm_postmaster_shutdown, 0);
+ elog(DEBUG2, "created dynamic shared memory control segment %lu ("
+ UINT64_FORMAT " bytes)", (unsigned long) dsm_control_handle,
+ segsize);
+ dsm_write_state_file(dsm_control_handle);
+
+ /* Initialize control segment. */
+ dsm_control->magic = PG_DYNSHMEM_CONTROL_MAGIC;
+ dsm_control->nitems = 0;
+ dsm_control->maxitems = maxitems;
+}
+
+/*
+ * Determine whether the control segment from the previous postmaster
+ * invocation still exists. If so, remove the dynamic shared memory
+ * segments to which it refers, and then the control segment itself.
+ */
+static void
+dsm_cleanup_using_control_segment(void)
+{
+ void *mapped_address = NULL;
+ void *junk_mapped_address = NULL;
+ void *impl_private = NULL;
+ void *junk_impl_private = NULL;
+ uint64 mapped_size = 0;
+ uint64 junk_mapped_size = 0;
+ uint32 nitems;
+ uint32 i;
+ dsm_handle old_control_handle;
+ dsm_control_header *old_control;
+
+ /*
+ * Read the state file. If it doesn't exist or is empty, there's nothing
+ * more to do.
+ */
+ if (!dsm_read_state_file(&old_control_handle))
+ return;
+
+ /*
+ * Try to attach the segment. If this fails, it probably just means that
+ * the operating system has been rebooted and the segment no longer exists,
+ * or an unrelated proces has used the same shm ID. So just fall out
+ * quietly.
+ */
+ if (!dsm_impl_op(DSM_OP_ATTACH, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, DEBUG1))
+ return;
+
+ /*
+ * We've managed to reattach it, but the contents might not be sane.
+ * If they aren't, we disregard the segment after all.
+ */
+ old_control = (dsm_control_header *) mapped_address;
+ if (!dsm_control_segment_sane(old_control, mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, LOG);
+ return;
+ }
+
+ /*
+ * OK, the control segment looks basically valid, so we can get use
+ * it to get a list of segments that need to be removed.
+ */
+ nitems = old_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ if (old_control->item[i].refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = old_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+ (unsigned long) handle);
+
+ /* Destroy the referenced segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Destroy the old control segment, too. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) old_control_handle);
+ dsm_impl_op(DSM_OP_DESTROY, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, LOG);
+}
+
+/*
+ * When we're using the mmap shared memory implementation, "shared memory"
+ * segments might even manage to survive an operating system reboot.
+ * But there's no guarantee as to exactly what will survive: some segments
+ * may survive, and others may not, and the contents of some may be out
+ * of date. In particular, the control segment may be out of date, so we
+ * can't rely on it to figure out what to remove. However, since we know
+ * what directory contains the files we used as shared memory, we can simply
+ * scan the directory and blow everything away that shouldn't be there.
+ */
+static void
+dsm_cleanup_for_mmap(void)
+{
+ DIR *dir;
+ struct dirent *dent;
+
+ /* Open the directory; can't use AllocateDir in postmaster. */
+ if ((dir = opendir(PG_DYNSHMEM_DIR)) == NULL)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ PG_DYNSHMEM_DIR)));
+
+ /* Scan for something with a name of the correct format. */
+ while ((dent = readdir(dir)) != NULL)
+ {
+ if (strncmp(dent->d_name, PG_DYNSHMEM_MMAP_FILE_PREFIX,
+ strlen(PG_DYNSHMEM_MMAP_FILE_PREFIX)) == 0)
+ {
+ char buf[MAXPGPATH];
+ snprintf(buf, MAXPGPATH, PG_DYNSHMEM_DIR "/%s", dent->d_name);
+
+ elog(DEBUG2, "removing file \"%s\"", buf);
+
+ /* We found a matching file; so remove it. */
+ if (unlink(buf) != 0)
+ {
+ int save_errno;
+
+ save_errno = errno;
+ closedir(dir);
+ errno = save_errno;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", buf)));
+ }
+ }
+ }
+
+ /* Cleanup complete. */
+ closedir(dir);
+}
+
+/*
+ * Read and parse the state file.
+ *
+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk. In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.
+ */
+static bool
+dsm_read_state_file(dsm_handle *h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes = 0;
+ char *endptr,
+ *s;
+ dsm_handle handle;
+
+ /* Read the state file to get the ID of the old control segment. */
+ statefd = open(PG_DYNSHMEM_STATE_FILE, O_RDONLY, 0);
+ if (statefd < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ }
+ nbytes = read(statefd, statebuf, PG_DYNSHMEM_STATE_BUFSIZ - 1);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ /* make sure buffer is NUL terminated */
+ statebuf[nbytes] = '\0';
+ close(statefd);
+
+ /*
+ * We expect to find the handle of the old control segment here,
+ * on a line by itself.
+ */
+ handle = strtoul(statebuf, &endptr, 10);
+ for (s = endptr; *s == ' ' || *s == '\t'; ++s)
+ ;
+ if (*s != '\n' && *s != '\0')
+ return false;
+
+ /* Looks good. */
+ *h = handle;
+ return true;
+}
+
+/*
+ * Write our control segment handle to the state file, so that if the
+ * postmaster is killed without running it's on_shmem_exit hooks, the
+ * next postmaster can clean things up after restart.
+ */
+static void
+dsm_write_state_file(dsm_handle h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes;
+
+ /* Create or truncate the file. */
+ statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
+ if (statefd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+
+ /* Write contents. */
+ snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
+ (unsigned long) dsm_control_handle);
+ nbytes = strlen(statebuf);
+ if (write(statefd, statebuf, nbytes) != nbytes)
+ {
+ if (errno == 0)
+ errno = ENOSPC; /* if no error signalled, assume no space */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+ }
+
+ /* Close file. */
+ close(statefd);
+
+ /*
+ * Atomically rename file into place, so that no one ever sees a partially
+ * written state file.
+ */
+ if (rename(PG_DYNSHMEM_NEW_STATE_FILE, PG_DYNSHMEM_STATE_FILE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+}
+
+/*
+ * At shutdown time, we iterate over the control segment and remove all
+ * remaining dynamic shared memory segments. We avoid throwing errors here;
+ * the postmaster is shutting down either way, and this is just non-critical
+ * resource cleanup.
+ */
+static void
+dsm_postmaster_shutdown(int code, Datum arg)
+{
+ uint32 nitems;
+ uint32 i;
+ void *dsm_control_address;
+ void *junk_mapped_address = NULL;
+ void *junk_impl_private = NULL;
+ uint64 junk_mapped_size = 0;
+
+ /* If dynamic shared memory is disabled, there's nothing to do. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ return;
+
+ /*
+ * If some other backend exited uncleanly, it might have corrupted the
+ * control segment while it was dying. In that case, we warn and ignore
+ * the contents of the control segment. This may end up leaving behind
+ * stray shared memory segments, but there's not much we can do about
+ * that if the metadata is gone.
+ */
+ nitems = dsm_control->nitems;
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ ereport(LOG,
+ (errmsg("dynamic shared memory control segment is corrupt")));
+ return;
+ }
+
+ /* Remove any remaining segments. */
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = dsm_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+ (unsigned long) handle);
+
+ /* Destroy the segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Remove the control segment itself. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) dsm_control_handle);
+ dsm_control_address = dsm_control;
+ dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0,
+ &dsm_control_impl_private, &dsm_control_address,
+ &dsm_control_mapped_size, LOG);
+ dsm_control = dsm_control_address;
+
+ /* And, finally, remove the state file. */
+ if (unlink(PG_DYNSHMEM_STATE_FILE) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not unlink file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+}
+
+/*
+ * Prepare this backend for dynamic shared memory usage. Under EXEC_BACKEND,
+ * we must reread the state file and map the control segment; in other cases,
+ * we'll have inherited the postmaster's mapping and global variables.
+ */
+static void
+dsm_backend_startup(void)
+{
+ /* If dynamic shared memory is disabled, reject this. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("dynamic shared memory is disabled"),
+ errhint("Set dynamic_shared_memory_type to a value other than \"one\".")));
+
+#ifdef EXEC_BACKEND
+ {
+ dsm_handle control_handle;
+ void *control_address = NULL;
+
+ /* Read the control segment information from the state file. */
+ if (!dsm_read_state_file(&control_handle))
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("could not parse dynamic shared memory state file")));
+
+ /* Attach control segment. */
+ dsm_impl_op(DSM_OP_ATTACH, control_handle, 0,
+ &dsm_control_impl_private, &control_address,
+ &dsm_control_mapped_size, ERROR);
+ dsm_control_handle = control_handle;
+ dsm_control = control_address;
+ /* If control segment doesn't look sane, something is badly wrong. */
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, control_handle, 0,
+ &dsm_control_impl_private, &control_address,
+ &dsm_control_mapped_size, WARNING);
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("dynamic shared memory control segment is not valid")));
+ }
+ }
+#endif
+
+ /* Arrange to detach segments on exit. */
+ on_shmem_exit(dsm_backend_shutdown, 0);
+
+ dsm_init_done = true;
+}
+
+/*
+ * Create a new dynamic shared memory segment.
+ */
+dsm_segment *
+dsm_create(uint64 size)
+{
+ dsm_segment *seg = dsm_create_descriptor();
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /* Loop until we find an unused segment identifier. */
+ for (;;)
+ {
+ Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
+ seg->handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR))
+ break;
+ }
+
+ /* Lock the control segment so we can register the new segment. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+
+ /* Search the control segment for an unused slot. */
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ if (dsm_control->item[i].refcnt == 0)
+ {
+ dsm_control->item[i].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[i].refcnt = 2;
+ seg->control_slot = i;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ return seg;
+ }
+ }
+
+ /* Verify that we can support an additional mapping. */
+ if (nitems >= dsm_control->maxitems)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("too many dynamic shared memory segments")));
+
+ /* Enter the handle into a new array slot. */
+ dsm_control->item[nitems].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[nitems].refcnt = 2;
+ seg->control_slot = nitems;
+ dsm_control->nitems++;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ return seg;
+}
+
+/*
+ * Attach a dynamic shared memory segment.
+ *
+ * See comments for dsm_segment_handle() for an explanation of how this
+ * is intended to be used.
+ *
+ * This function will return NULL if the segment isn't known to the system.
+ * This can happen if we're asked to attach the segment, but then everyone
+ * else detaches it (causing it to be destroyed) before we get around to
+ * attaching it.
+ */
+dsm_segment *
+dsm_attach(dsm_handle h)
+{
+ dsm_segment *seg;
+ dlist_iter iter;
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /*
+ * Since this is just a debugging cross-check, we could leave it out
+ * altogether, or include it only in assert-enabled builds. But since
+ * the list of attached segments should normally be very short, let's
+ * include it always for right now.
+ *
+ * If you're hitting this error, you probably want to attempt to
+ * find an existing mapping via dsm_find_mapping() before calling
+ * dsm_attach() to create a new one.
+ */
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ elog(ERROR, "can't attach the same segment more than once");
+ }
+
+ /* Create a new segment descriptor. */
+ seg = dsm_create_descriptor();
+ seg->handle = h;
+
+ /* Bump reference count for this segment in shared memory. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /*
+ * If the reference count is 1, the slot is still in use, but the
+ * segment is in the process of going away. Treat that as if we
+ * didn't find a match.
+ */
+ if (dsm_control->item[i].refcnt == 1)
+ break;
+
+ /* Otherwise, if the descriptor matches, we've found a match. */
+ if (dsm_control->item[i].handle == seg->handle)
+ {
+ dsm_control->item[i].refcnt++;
+ seg->control_slot = i;
+ break;
+ }
+ }
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /*
+ * If we didn't find the handle we're looking for in the control
+ * segment, it probably means that everyone else who had it mapped,
+ * including the original creator, died before we got to this point.
+ * It's up to the caller to decide what to do about that.
+ */
+ if (seg->control_slot == INVALID_CONTROL_SLOT)
+ {
+ dsm_detach(seg);
+ return NULL;
+ }
+
+ /* Here's where we actually try to map the segment. */
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg;
+}
+
+/*
+ * At backend shutdown time, detach any segments that are still attached.
+ */
+static void
+dsm_backend_shutdown(int code, Datum arg)
+{
+ while (!dlist_is_empty(&dsm_segment_list))
+ {
+ dsm_segment *seg;
+
+ seg = dlist_head_element(dsm_segment, node, &dsm_segment_list);
+ dsm_detach(seg);
+ }
+}
+
+/*
+ * Resize an existing shared memory segment.
+ *
+ * This may cause the shared memory segment to be remapped at a different
+ * address. For the caller's convenience, we return the mapped address.
+ */
+void *
+dsm_resize(dsm_segment *seg, uint64 size)
+{
+ Assert(seg->control_slot != INVALID_CONTROL_SLOT);
+ dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+ return seg->mapped_address;
+}
+
+/*
+ * Remap an existing shared memory segment.
+ *
+ * This is intended to be used when some other process has extended the
+ * mapping using dsm_resize(), but we've still only got the initial
+ * portion mapped. Since this might change the address at which the
+ * segment is mapped, we return the new mapped address.
+ */
+void *
+dsm_remap(dsm_segment *seg)
+{
+ if (!dsm_impl_can_resize())
+ return seg->mapped_address;
+
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg->mapped_address;
+}
+
+/*
+ * Detach from a shared memory segment, destroying the segment if we
+ * remove the last reference.
+ *
+ * This function should never fail. It will often be invoked when aborting
+ * a transaction, and a further error won't serve any purpose. It's not a
+ * complete disaster if we fail to unmap or destroy the segment; it means a
+ * resource leak, but that doesn't necessarily preclude further operations.
+ */
+void
+dsm_detach(dsm_segment *seg)
+{
+ /*
+ * Try to remove the mapping, if one exists. Normally, there will be,
+ * but maybe not, if we failed partway through a create or attach
+ * operation. We remove the mapping before decrementing the reference
+ * count so that the process that sees a zero reference count can be
+ * certain that no remaining mappings exist. Even if this fails, we
+ * pretend that it works, because retrying is likely to fail in the
+ * same way.
+ */
+ if (seg->mapped_address != NULL)
+ {
+ dsm_impl_op(DSM_OP_DETACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, WARNING);
+ seg->impl_private = NULL;
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+ }
+
+ /* Reduce reference count, if we previously increased it. */
+ if (seg->control_slot != INVALID_CONTROL_SLOT)
+ {
+ uint32 refcnt;
+ uint32 control_slot = seg->control_slot;
+
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt > 1);
+ refcnt = --dsm_control->item[control_slot].refcnt;
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /* If new reference count is 1, try to destroy the segment. */
+ if (refcnt == 1)
+ {
+ /*
+ * If we fail to destroy the segment here, or are killed before
+ * we finish doing so, the reference count will remain at 1, which
+ * will mean that nobody else can attach to the segment. At
+ * postmaster shutdown time, or when a new postmaster is started
+ * after a hard kill, another attempt will be made to remove the
+ * segment.
+ *
+ * The main case we're worried about here is being killed by
+ * a signal before we can finish removing the segment. In that
+ * case, it's important to be sure that the segment still gets
+ * removed. If we actually fail to remove the segment for some
+ * other reason, the postmaster may not have any better luck than
+ * we did. There's not much we can do about that, though.
+ */
+ if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, WARNING))
+ {
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt == 1);
+ dsm_control->item[control_slot].refcnt = 0;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ }
+ }
+ }
+
+ /* Clean up our remaining backend-private data structures. */
+ if (seg->resowner != NULL)
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ dlist_delete(&seg->node);
+ pfree(seg);
+}
+
+/*
+ * Keep a dynamic shared memory mapping until end of session.
+ *
+ * By default, mappings are owned by the current resource owner, which
+ * typically means they stick around for the duration of the current query
+ * only.
+ */
+void
+dsm_keep_mapping(dsm_segment *seg)
+{
+ if (seg->resowner != NULL)
+ {
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ seg->resowner = NULL;
+ }
+}
+
+/*
+ * Find an existing mapping for a shared memory segment, if there is one.
+ */
+dsm_segment *
+dsm_find_mapping(dsm_handle h)
+{
+ dlist_iter iter;
+ dsm_segment *seg;
+
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ return seg;
+ }
+
+ return NULL;
+}
+
+/*
+ * Get the address at which a dynamic shared memory segment is mapped.
+ */
+void *
+dsm_segment_address(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_address;
+}
+
+/*
+ * Get the size of a mapping.
+ */
+uint64
+dsm_segment_map_length(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_size;
+}
+
+/*
+ * Get a handle for a mapping.
+ *
+ * To establish communication via dynamic shared memory between two backends,
+ * one of them should first call dsm_create() to establish a new shared
+ * memory mapping. That process should then call dsm_segment_handle() to
+ * obtain a handle for the mapping, and pass that handle to the
+ * coordinating backend via some means (e.g. bgw_main_arg, or via the
+ * main shared memory segment). The recipient, once in position of the
+ * handle, should call dsm_attach().
+ */
+dsm_handle
+dsm_segment_handle(dsm_segment *seg)
+{
+ return seg->handle;
+}
+
+/*
+ * Create a segment descriptor.
+ */
+static dsm_segment *
+dsm_create_descriptor(void)
+{
+ dsm_segment *seg;
+
+ ResourceOwnerEnlargeDSMs(CurrentResourceOwner);
+
+ seg = MemoryContextAlloc(TopMemoryContext, sizeof(dsm_segment));
+ dlist_push_head(&dsm_segment_list, &seg->node);
+
+ /* seg->handle must be initialized by the caller */
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ seg->impl_private = NULL;
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+
+ seg->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberDSM(CurrentResourceOwner, seg);
+
+ return seg;
+}
+
+/*
+ * Sanity check a control segment.
+ *
+ * The goal here isn't to detect everything that could possibly be wrong with
+ * the control segment; there's not enough information for that. Rather, the
+ * goal is to make sure that someone can iterate over the items in the segment
+ * without overrunning the end of the mapping and crashing. We also check
+ * the magic number since, if that's messed up, this may not even be one of
+ * our segments at all.
+ */
+static bool
+dsm_control_segment_sane(dsm_control_header *control, uint64 mapped_size)
+{
+ if (mapped_size < offsetof(dsm_control_header, item))
+ return false; /* Mapped size too short to read header. */
+ if (control->magic != PG_DYNSHMEM_CONTROL_MAGIC)
+ return false; /* Magic number doesn't match. */
+ if (dsm_control_bytes_needed(control->maxitems) > mapped_size)
+ return false; /* Max item count won't fit in map. */
+ if (control->nitems > control->maxitems)
+ return false; /* Overfull. */
+ return true;
+}
+
+/*
+ * Compute the number of control-segment bytes needed to store a given
+ * number of items.
+ */
+static uint64
+dsm_control_bytes_needed(uint32 nitems)
+{
+ return offsetof(dsm_control_header, item)
+ + sizeof(dsm_control_item) * (uint64) nitems;
+}
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
new file mode 100644
index 0000000..8005fd9
--- /dev/null
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -0,0 +1,986 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides low-level APIs for creating and destroying shared
+ * memory segments using several different possible techniques. We refer
+ * to these segments as dynamic because they can be created, altered, and
+ * destroyed at any point during the server life cycle. This is unlike
+ * the main shared memory segment, of which there is always exactly one
+ * and which is always mapped at a fixed address in every PostgreSQL
+ * background process.
+ *
+ * Because not all systems provide the same primitives in this area, nor
+ * do all primitives behave the saem way on all systems, we provide
+ * several implementations of this facility. Many systems implement
+ * POSIX shared memory (shm_open etc.), which is well-suited to our needs
+ * in this area, with the exception that shared memory identifiers live
+ * in a flat system-wide namespace, raising the uncomfortable prospect of
+ * name collisions with other processes (including other copies of
+ * PostgreSQL) running on the same system. Some systems only support
+ * the older System V shared memory interface (shmget etc.) which is
+ * also usable; however, the default allocation limits are often quite
+ * small, and the namespace is even more restricted.
+ *
+ * We also provide an mmap-based shared memory implementation. This may
+ * be useful on systems that provide shared memory via a special-purpose
+ * filesystem; by opting for this implementation, the user can even
+ * control precisely where their shared memory segments are placed. It
+ * can also be used as a fallback for systems where shm_open and shmget
+ * are not available or can't be used for some reason. Of course,
+ * mapping a file residing on an actual spinning disk is a fairly poor
+ * approximation for shared memory because writeback may hurt performance
+ * substantially, but there should be few systems where we must make do
+ * with such poor tools.
+ *
+ * As ever, Windows requires its own implemetation.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
+#include <sys/stat.h>
+#ifdef HAVE_SYS_IPC_H
+#include <sys/ipc.h>
+#endif
+#ifdef HAVE_SYS_SHM_H
+#include <sys/shm.h>
+#endif
+
+#include "portability/mem.h"
+#include "storage/dsm_impl.h"
+#include "storage/fd.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+#ifdef USE_DSM_POSIX
+static bool dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_SYSV
+static bool dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_WINDOWS
+static bool dsm_impl_windows(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_MMAP
+static bool dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+static int errcode_for_dynamic_shared_memory(void);
+
+const struct config_enum_entry dynamic_shared_memory_options[] = {
+#ifdef USE_DSM_POSIX
+ { "posix", DSM_IMPL_POSIX, false},
+#endif
+#ifdef USE_DSM_SYSV
+ { "sysv", DSM_IMPL_SYSV, false},
+#endif
+#ifdef USE_DSM_WINDOWS
+ { "windows", DSM_IMPL_WINDOWS, false},
+#endif
+#ifdef USE_DSM_MMAP
+ { "mmap", DSM_IMPL_MMAP, false},
+#endif
+ { "none", DSM_IMPL_NONE, false},
+ {NULL, 0, false}
+};
+
+/* Implementation selector. */
+int dynamic_shared_memory_type;
+
+/* Size of buffer to be used for zero-filling. */
+#define ZBUFFER_SIZE 8192
+
+/*------
+ * Perform a low-level shared memory operation in a platform-specific way,
+ * as dictated by the selected implementation. Each implementation is
+ * required to implement the following primitives.
+ *
+ * DSM_OP_CREATE. Create a segment whose size is the request_size and
+ * map it.
+ *
+ * DSM_OP_ATTACH. Map the segment, whose size must be the request_size.
+ * The segment may already be mapped; any existing mapping should be removed
+ * before creating a new one.
+ *
+ * DSM_OP_DETACH. Unmap the segment.
+ *
+ * DSM_OP_RESIZE. Resize the segment to the given request_size and
+ * remap the segment at that new size.
+ *
+ * DSM_OP_DESTROY. Unmap the segment, if it is mapped. Destroy the
+ * segment.
+ *
+ * Arguments:
+ * op: The operation to be performed.
+ * handle: The handle of an existing object, or for DSM_OP_CREATE, the
+ * a new handle the caller wants created.
+ * request_size: For DSM_OP_CREATE, the requested size. For DSM_OP_RESIZE,
+ * the new size. Otherwise, 0.
+ * impl_private: Private, implementation-specific data. Will be a pointer
+ * to NULL for the first operation on a shared memory segment within this
+ * backend; thereafter, it will point to the value to which it was set
+ * on the previous call.
+ * mapped_address: Pointer to start of current mapping; pointer to NULL
+ * if none. Updated with new mapping address.
+ * mapped_size: Pointer to size of current mapping; pointer to 0 if none.
+ * Updated with new mapped size.
+ * elevel: Level at which to log errors.
+ *
+ * Return value: true on success, false on failure. When false is returned,
+ * a message should first be logged at the specified elevel, except in the
+ * case where DSM_OP_CREATE experiences a name collision, which should
+ * silently return false.
+ *-----
+ */
+bool
+dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ Assert(op == DSM_OP_CREATE || op == DSM_OP_RESIZE || request_size == 0);
+ Assert((op != DSM_OP_CREATE && op != DSM_OP_ATTACH) ||
+ (*mapped_address == NULL && *mapped_size == 0));
+
+ if (request_size > (size_t) -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("requested shared memory size overflows size_t")));
+
+ switch (dynamic_shared_memory_type)
+ {
+#ifdef USE_DSM_POSIX
+ case DSM_IMPL_POSIX:
+ return dsm_impl_posix(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_SYSV
+ case DSM_IMPL_SYSV:
+ return dsm_impl_sysv(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_WINDOWS
+ case DSM_IMPL_WINDOWS:
+ return dsm_impl_windows(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_MMAP
+ case DSM_IMPL_MMAP:
+ return dsm_impl_mmap(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+ }
+ elog(ERROR, "unexpected dynamic shared memory type: %d",
+ dynamic_shared_memory_type);
+}
+
+/*
+ * Does the current dynamic shared memory implementation support resizing
+ * segments? (The answer here could be platform-dependent in the future,
+ * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently
+ * can't resize segments to anything larger than 256MB that way. For now,
+ * we keep it simple.)
+ */
+bool
+dsm_impl_can_resize(void)
+{
+ switch (dynamic_shared_memory_type)
+ {
+ case DSM_IMPL_NONE:
+ return false;
+ case DSM_IMPL_SYSV:
+ return false;
+ case DSM_IMPL_WINDOWS:
+ return false;
+ default:
+ return true;
+ }
+}
+
+#ifdef USE_DSM_POSIX
+/*
+ * Operating system primitives to support POSIX shared memory.
+ *
+ * POSIX shared memory segments are created and attached using shm_open()
+ * and shm_unlink(); other operations, such as sizing or mapping the
+ * segment, are performed as if the shared memory segments were files.
+ *
+ * Indeed, on some platforms, they may be implemented that way. While
+ * POSIX shared memory segments seem intended to exist in a flat namespace,
+ * some operating systems may implement them as files, even going so far
+ * to treat a request for /xyz as a request to create a file by that name
+ * in the root directory. Users of such broken platforms should select
+ * a different shared memory implementation.
+ */
+static bool
+dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shm_unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Create new segment or open an existing one for attach or resize.
+ *
+ * Even though we're not going through fd.c, we should be safe against
+ * running out of file descriptors, because of NUM_RESERVED_FDS. We're
+ * only opening one extra descriptor here, and we'll close it before
+ * returning.
+ */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = shm_open(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size != request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(NULL, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ close(fd);
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_SYSV
+/*
+ * Operating system primitives to support System V shared memory.
+ *
+ * System V shared memory segments are manipulated using shmget(), shmat(),
+ * shmdt(), and shmctl(). There's no portable way to resize such
+ * segments. As the default allocation limits for System V shared memory
+ * are usually quite low, the POSIX facilities may be preferable; but
+ * those are not supported everywhere.
+ */
+static bool
+dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ key_t key;
+ int ident;
+ char *address;
+ char name[64];
+ int *ident_cache;
+
+ /* Resize is not supported for System V shared memory. */
+ if (op == DSM_OP_RESIZE)
+ {
+ elog(elevel, "System V shared memory segments cannot be resized");
+ return false;
+ }
+
+ /* Since resize isn't supported, reattach is a no-op. */
+ if (op == DSM_OP_ATTACH && *mapped_address != NULL)
+ return true;
+
+ /*
+ * POSIX shared memory and mmap-based shared memory identify segments
+ * with names. To avoid needless error message variation, we use the
+ * handle as the name.
+ */
+ snprintf(name, 64, "%lu", (unsigned long) handle);
+
+ /*
+ * The System V shared memory namespace is very restricted; names are
+ * of type key_t, which is expected to be some sort of integer data type,
+ * but not necessarily the same one as dsm_handle. Since we use
+ * dsm_handle to identify shared memory segments across processes, this
+ * might seem like a problem, but it's really not. If dsm_handle is
+ * bigger than key_t, the cast below might truncate away some bits from
+ * the handle the user-provided, but it'll truncate exactly the same bits
+ * away in exactly the same fashion every time we use that handle, which
+ * is all that really matters. Conversely, if dsm_handle is smaller than
+ * key_t, we won't use the full range of available key space, but that's
+ * no big deal either.
+ *
+ * We do make sure that the key isn't negative, because that might not
+ * be portable.
+ */
+ key = (key_t) handle;
+ if (key < 1) /* avoid compiler warning if type is unsigned */
+ key = -key;
+
+ /*
+ * There's one special key, IPC_PRIVATE, which can't be used. If we end
+ * up with that value by chance during a create operation, just pretend
+ * it already exists, so that caller will retry. If we run into it
+ * anywhere else, the caller has passed a handle that doesn't correspond
+ * to anything we ever created, which should not happen.
+ */
+ if (key == IPC_PRIVATE)
+ {
+ if (op != DSM_OP_CREATE)
+ elog(elevel, "System V shared memory key may not be IPC_PRIVATE");
+ errno = EEXIST;
+ return false;
+ }
+
+ /*
+ * Before we can do anything with a shared memory segment, we have to
+ * map the shared memory key to a shared memory identifier using shmget().
+ * To avoid repeated lookups, we store the key using impl_private.
+ */
+ if (*impl_private != NULL)
+ {
+ ident_cache = *impl_private;
+ ident = *ident_cache;
+ }
+ else
+ {
+ int flags = IPCProtection;
+ size_t segsize;
+
+ /*
+ * Allocate the memory BEFORE acquiring the resource, so that we don't
+ * leak the resource if memory allocation fails.
+ */
+ ident_cache = MemoryContextAlloc(TopMemoryContext, sizeof(int));
+
+ /*
+ * When using shmget to find an existing segment, we must pass the
+ * size as 0. Passing a non-zero size which is greater than the
+ * actual size will result in EINVAL.
+ */
+ segsize = 0;
+
+ if (op == DSM_OP_CREATE)
+ {
+ flags |= IPC_CREAT | IPC_EXCL;
+ segsize = request_size;
+ }
+
+ if ((ident = shmget(key, segsize, flags)) == -1)
+ {
+ if (errno != EEXIST)
+ {
+ int save_errno = errno;
+ pfree(ident_cache);
+ errno = save_errno;
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not get shared memory segment: %m")));
+ }
+ return false;
+ }
+
+ *ident_cache = ident;
+ *impl_private = ident_cache;
+ }
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ pfree(ident_cache);
+ *impl_private = NULL;
+ if (*mapped_address != NULL && shmdt(*mapped_address) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* If we're attaching it, we must use IPC_STAT to determine the size. */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct shmid_ds shm;
+
+ if (shmctl(ident, IPC_STAT, &shm) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = shm.shm_segsz;
+ }
+
+ /* Map it. */
+ address = shmat(ident, NULL, PG_SHMAT_FLAGS);
+ if (address == (void *) -1)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_WINDOWS
+/*
+ * Operating system primitives to support Windows shared memory.
+ *
+ * Windows shared memory implementation is done using file mapping
+ * which can be backed by either physical file or system paging file.
+ * Current implementation uses system paging file as other effects
+ * like performance are not clear for physical file and it is used in similar
+ * way for main shared memory in windows.
+ *
+ * A memory mapping object is a kernel object - they always get deleted when
+ * the last reference to them goes away, either explicitly via a CloseHandle or
+ * when the process containing the reference exits.
+ */
+static bool
+dsm_impl_windows(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ char *address;
+ HANDLE hmap;
+ char name[64];
+ MEMORY_BASIC_INFORMATION info;
+
+ /* Resize is not supported for Windows shared memory. */
+ if (op == DSM_OP_RESIZE)
+ {
+ elog(elevel, "Windows shared memory segments cannot be resized");
+ return false;
+ }
+
+ /* Since resize isn't supported, reattach is a no-op. */
+ if (op == DSM_OP_ATTACH && *mapped_address != NULL)
+ return true;
+
+ /*
+ * Storing the shared memory segment in the Global\ namespace, can
+ * allow any process running in any session to access that file
+ * mapping object provided that the caller has the required access rights.
+ * But to avoid issues faced in main shared memory, we are using the naming
+ * convention similar to main shared memory. We can change here once
+ * issue mentioned in GetSharedMemName is resolved.
+ */
+ snprintf(name, 64, "Global/PostgreSQL.%lu", (unsigned long) handle);
+
+ /*
+ * Handle teardown cases. Since Windows automatically destroys the object
+ * when no references reamin, we can treat it the same as detach.
+ */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && UnmapViewOfFile(*mapped_address) == 0)
+ {
+ _dosmaperr(GetLastError());
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ if (*impl_private != NULL
+ && CloseHandle(*impl_private) == 0)
+ {
+ _dosmaperr(GetLastError());
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ *impl_private = NULL;
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ return true;
+ }
+
+ /* Create new segment or open an existing one for attach. */
+ if (op == DSM_OP_CREATE)
+ {
+ DWORD size_high = (DWORD) (request_size >> 32);
+ DWORD size_low = (DWORD) request_size;
+ hmap = CreateFileMapping(INVALID_HANDLE_VALUE, /* Use the pagefile */
+ NULL, /* Default security attrs */
+ PAGE_READWRITE, /* Memory is read/write */
+ size_high, /* Upper 32 bits of size */
+ size_low, /* Lower 32 bits of size */
+ name);
+ _dosmaperr(GetLastError());
+ if (errno == EEXIST)
+ {
+ /*
+ * On Windows, when the segment already exists, a handle for the
+ * existing segment is returned. We must close it before
+ * returning. We don't do _dosmaperr here, so errno won't be
+ * modified.
+ */
+ CloseHandle(hmap);
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ }
+ else
+ {
+ hmap = OpenFileMapping(FILE_MAP_WRITE | FILE_MAP_READ,
+ FALSE, /* do not inherit the name */
+ name); /* name of mapping object */
+ _dosmaperr(GetLastError());
+ }
+
+ if (!hmap)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /* Map it. */
+ address = MapViewOfFile(hmap, FILE_MAP_WRITE | FILE_MAP_READ,
+ 0, 0, 0);
+ if (!address)
+ {
+ int save_errno;
+
+ _dosmaperr(GetLastError());
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseHandle(hmap);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * VirtualQuery gives size in page_size units, which is 4K for Windows.
+ * We need size only when we are attaching, but it's better to get the
+ * size when creating new segment to keep size consistent both for
+ * DSM_OP_CREATE and DSM_OP_ATTACH.
+ */
+ if (VirtualQuery(address, &info, sizeof(info)) == 0)
+ {
+ int save_errno;
+
+ _dosmaperr(GetLastError());
+ /* Back out what's already been done. */
+ save_errno = errno;
+ UnmapViewOfFile(address);
+ CloseHandle(hmap);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ *mapped_address = address;
+ *mapped_size = info.RegionSize;
+ *impl_private = hmap;
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_MMAP
+/*
+ * Operating system primitives to support mmap-based shared memory.
+ *
+ * Calling this "shared memory" is somewhat of a misnomer, because what
+ * we're really doing is creating a bunch of files and mapping them into
+ * our address space. The operating system may feel obliged to
+ * synchronize the contents to disk even if nothing is being paged out,
+ * which will not serve us well. The user can relocate the pg_dynshmem
+ * directory to a ramdisk to avoid this problem, if available.
+ */
+static bool
+dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, PG_DYNSHMEM_DIR "/" PG_DYNSHMEM_MMAP_FILE_PREFIX "%lu",
+ (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* Create new segment or open an existing one for attach or resize. */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = OpenTransientFile(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size > request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ else if (*mapped_size < request_size)
+ {
+ /*
+ * Allocate a buffer full of zeros.
+ *
+ * Note: palloc zbuffer, instead of just using a local char array,
+ * to ensure it is reasonably well-aligned; this may save a few
+ * cycles transferring data to the kernel.
+ */
+ char *zbuffer = (char *) palloc0(ZBUFFER_SIZE);
+ uint32 remaining = request_size;
+ bool success = true;
+
+ /*
+ * Zero-fill the file. We have to do this the hard way to ensure
+ * that all the file space has really been allocated, so that we
+ * don't later seg fault when accessing the memory mapping. This
+ * is pretty pessimal.
+ */
+ while (success && remaining > 0)
+ {
+ uint64 goal = remaining;
+
+ if (goal > ZBUFFER_SIZE)
+ goal = ZBUFFER_SIZE;
+ if (write(fd, zbuffer, goal) == goal)
+ remaining -= goal;
+ else
+ success = false;
+ }
+
+ if (!success)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno ? save_errno : ENOSPC;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(NULL, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ CloseTransientFile(fd);
+
+ return true;
+}
+#endif
+
+static int
+errcode_for_dynamic_shared_memory()
+{
+ if (errno == EFBIG || errno == ENOMEM)
+ return errcode(ERRCODE_OUT_OF_MEMORY);
+ else
+ return errcode_for_file_access();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a0b741b..040c7aa 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
@@ -249,6 +250,10 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
ShmemBackendArrayAllocation();
#endif
+ /* Initialize dynamic shared memory facilities. */
+ if (!IsUnderPostmaster)
+ dsm_postmaster_startup();
+
/*
* Now give loadable modules a chance to set up their shmem allocations
*/
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7d297bc..4dfb3cc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm_impl.h"
#include "storage/standby.h"
#include "storage/fd.h"
#include "storage/proc.h"
@@ -385,6 +386,7 @@ static const struct config_enum_entry synchronous_commit_options[] = {
*/
extern const struct config_enum_entry wal_level_options[];
extern const struct config_enum_entry sync_method_options[];
+extern const struct config_enum_entry dynamic_shared_memory_options[];
/*
* GUC option variables that are exported from this module
@@ -3324,6 +3326,16 @@ static struct config_enum ConfigureNamesEnum[] =
},
{
+ {"dynamic_shared_memory_type", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the dynamic shared memory implementation used."),
+ NULL
+ },
+ &dynamic_shared_memory_type,
+ DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE, dynamic_shared_memory_options,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Selects the method used for forcing WAL updates to disk."),
NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..c9cea28 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -123,6 +123,13 @@
#work_mem = 1MB # min 64kB
#maintenance_work_mem = 16MB # min 1MB
#max_stack_depth = 2MB # min 100kB
+#dynamic_shared_memory_type = posix # the default is the first option
+ # supported by the operating system:
+ # posix
+ # sysv
+ # windows
+ # mmap
+ # use none to disable dynamic shared memory
# - Disk -
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e7ec393..43542cf 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -98,6 +98,11 @@ typedef struct ResourceOwnerData
int nfiles; /* number of owned temporary files */
File *files; /* dynamically allocated array */
int maxfiles; /* currently allocated array size */
+
+ /* We have built-in support for remembering dynamic shmem segments */
+ int ndsms; /* number of owned shmem segments */
+ dsm_segment **dsms; /* dynamically allocated array */
+ int maxdsms; /* currently allocated array size */
} ResourceOwnerData;
@@ -132,6 +137,7 @@ static void PrintPlanCacheLeakWarning(CachedPlan *plan);
static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
+static void PrintDSMLeakWarning(dsm_segment *seg);
/*****************************************************************************
@@ -271,6 +277,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintRelCacheLeakWarning(owner->relrefs[owner->nrelrefs - 1]);
RelationClose(owner->relrefs[owner->nrelrefs - 1]);
}
+
+ /*
+ * Release dynamic shared memory segments. Note that dsm_detach()
+ * will remove the segment from my list, so I just have to iterate
+ * until there are none.
+ *
+ * As in the preceding cases, warn if there are leftover at commit
+ * time.
+ */
+ while (owner->ndsms > 0)
+ {
+ if (isCommit)
+ PrintDSMLeakWarning(owner->dsms[owner->ndsms - 1]);
+ dsm_detach(owner->dsms[owner->ndsms - 1]);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -438,6 +459,8 @@ ResourceOwnerDelete(ResourceOwner owner)
pfree(owner->snapshots);
if (owner->files)
pfree(owner->files);
+ if (owner->dsms)
+ pfree(owner->dsms);
pfree(owner);
}
@@ -1230,3 +1253,88 @@ PrintFileLeakWarning(File file)
"temporary file leak: File %d still referenced",
file);
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * dynamic shmem segment reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeDSMs(ResourceOwner owner)
+{
+ int newmax;
+
+ if (owner->ndsms < owner->maxdsms)
+ return; /* nothing to do */
+
+ if (owner->dsms == NULL)
+ {
+ newmax = 16;
+ owner->dsms = (dsm_segment **)
+ MemoryContextAlloc(TopMemoryContext,
+ newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+ else
+ {
+ newmax = owner->maxdsms * 2;
+ owner->dsms = (dsm_segment **)
+ repalloc(owner->dsms, newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+}
+
+/*
+ * Remember that a dynamic shmem segment is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeDSMs()
+ */
+void
+ResourceOwnerRememberDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ Assert(owner->ndsms < owner->maxdsms);
+ owner->dsms[owner->ndsms] = seg;
+ owner->ndsms++;
+}
+
+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ dsm_segment **dsms = owner->dsms;
+ int ns1 = owner->ndsms - 1;
+ int i;
+
+ for (i = ns1; i >= 0; i--)
+ {
+ if (dsms[i] == seg)
+ {
+ while (i < ns1)
+ {
+ dsms[i] = dsms[i + 1];
+ i++;
+ }
+ owner->ndsms = ns1;
+ return;
+ }
+ }
+ elog(ERROR,
+ "dynamic shared memory segment %lu is not owned by resource owner %s",
+ (unsigned long) dsm_segment_handle(seg), owner->name);
+}
+
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintDSMLeakWarning(dsm_segment *seg)
+{
+ elog(WARNING,
+ "dynamic shared memory leak: segment %lu still referenced",
+ (unsigned long) dsm_segment_handle(seg));
+}
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f66f530..a6eb0d8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -182,6 +182,7 @@ const char *subdirs[] = {
"pg_xlog",
"pg_xlog/archive_status",
"pg_clog",
+ "pg_dynshmem",
"pg_notify",
"pg_serial",
"pg_snapshots",
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 8aabf3c..5eac52d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
/* Define to 1 if you have the `setsid' function. */
#undef HAVE_SETSID
+/* Define to 1 if you have the `shm_open' function. */
+#undef HAVE_SHM_OPEN
+
/* Define to 1 if you have the `sigprocmask' function. */
#undef HAVE_SIGPROCMASK
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
new file mode 100644
index 0000000..2a07c10
--- /dev/null
+++ b/src/include/portability/mem.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * mem.h
+ * portability definitions for various memory operations
+ *
+ * Copyright (c) 2001-2013, PostgreSQL Global Development Group
+ *
+ * src/include/portability/mem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MEM_H
+#define MEM_H
+
+#define IPCProtection (0600) /* access/modify by user only */
+
+#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
+#define PG_SHMAT_FLAGS SHM_SHARE_MMU
+#else
+#define PG_SHMAT_FLAGS 0
+#endif
+
+/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+
+/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
+#ifndef MAP_HASSEMAPHORE
+#define MAP_HASSEMAPHORE 0
+#endif
+
+#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+
+/* Some really old systems don't define MAP_FAILED. */
+#ifndef MAP_FAILED
+#define MAP_FAILED ((void *) -1)
+#endif
+
+#endif /* MEM_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
new file mode 100644
index 0000000..2b5e722
--- /dev/null
+++ b/src/include/storage/dsm.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.h
+ * manage dynamic shared memory segments
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_H
+#define DSM_H
+
+#include "storage/dsm_impl.h"
+
+typedef struct dsm_segment dsm_segment;
+
+/* Initialization function. */
+extern void dsm_postmaster_startup(void);
+
+/* Functions that create, update, or remove mappings. */
+extern dsm_segment *dsm_create(uint64 size);
+extern dsm_segment *dsm_attach(dsm_handle h);
+extern void *dsm_resize(dsm_segment *seg, uint64 size);
+extern void *dsm_remap(dsm_segment *seg);
+extern void dsm_detach(dsm_segment *seg);
+
+/* Resource management functions. */
+extern void dsm_keep_mapping(dsm_segment *seg);
+extern dsm_segment *dsm_find_mapping(dsm_handle h);
+
+/* Informational functions. */
+extern void *dsm_segment_address(dsm_segment *seg);
+extern uint64 dsm_segment_map_length(dsm_segment *seg);
+extern dsm_handle dsm_segment_handle(dsm_segment *seg);
+
+#endif /* DSM_H */
diff --git a/src/include/storage/dsm_impl.h b/src/include/storage/dsm_impl.h
new file mode 100644
index 0000000..13f1f48
--- /dev/null
+++ b/src/include/storage/dsm_impl.h
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.h
+ * low-level dynamic shared memory primitives
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_IMPL_H
+#define DSM_IMPL_H
+
+/* Dynamic shared memory implementations. */
+#define DSM_IMPL_NONE 0
+#define DSM_IMPL_POSIX 1
+#define DSM_IMPL_SYSV 2
+#define DSM_IMPL_WINDOWS 3
+#define DSM_IMPL_MMAP 4
+
+/*
+ * Determine which dynamic shared memory implementations will be supported
+ * on this platform, and which one will be the default.
+ */
+#ifdef WIN32
+#define USE_DSM_WINDOWS
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_WINDOWS
+#else
+#ifdef HAVE_SHM_OPEN
+#define USE_DSM_POSIX
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_POSIX
+#endif
+#define USE_DSM_SYSV
+#ifndef DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_SYSV
+#endif
+#define USE_DSM_MMAP
+#endif
+
+/* GUC. */
+extern int dynamic_shared_memory_type;
+
+/*
+ * Directory for on-disk state.
+ *
+ * This is used by all implementations for crash recovery and by the mmap
+ * implementation for storage.
+ */
+#define PG_DYNSHMEM_DIR "pg_dynshmem"
+#define PG_DYNSHMEM_MMAP_FILE_PREFIX "mmap."
+
+/* A "name" for a dynamic shared memory segment. */
+typedef uint32 dsm_handle;
+
+/* All the shared-memory operations we know about. */
+typedef enum
+{
+ DSM_OP_CREATE,
+ DSM_OP_ATTACH,
+ DSM_OP_DETACH,
+ DSM_OP_RESIZE,
+ DSM_OP_DESTROY
+} dsm_op;
+
+/* Create, attach to, detach from, resize, or destroy a segment. */
+extern bool dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel);
+
+/* Some implementations cannot resize segments. Can this one? */
+extern bool dsm_impl_can_resize(void);
+
+#endif /* DSM_IMPL_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 39415a3..730c47b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -80,6 +80,7 @@ typedef enum LWLockId
OldSerXidLock,
SyncRepLock,
BackgroundWorkerLock,
+ DynamicSharedMemoryControlLock,
/* Individual lock IDs end here */
FirstBufMappingLock,
FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a5d8707..6693483 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -16,6 +16,7 @@
#ifndef RESOWNER_PRIVATE_H
#define RESOWNER_PRIVATE_H
+#include "storage/dsm.h"
#include "storage/fd.h"
#include "storage/lock.h"
#include "utils/catcache.h"
@@ -80,4 +81,11 @@ extern void ResourceOwnerRememberFile(ResourceOwner owner,
extern void ResourceOwnerForgetFile(ResourceOwner owner,
File file);
+/* support for dynamic shared memory management */
+extern void ResourceOwnerEnlargeDSMs(ResourceOwner owner);
+extern void ResourceOwnerRememberDSM(ResourceOwner owner,
+ dsm_segment *);
+extern void ResourceOwnerForgetDSM(ResourceOwner owner,
+ dsm_segment *);
+
#endif /* RESOWNER_PRIVATE_H */
Hi Robert, Hi Amit,
Ok, first read through the patch.
On 2013-09-13 15:32:36 -0400, Robert Haas wrote:
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l]) +AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
Maybe also check for shm_unlink or is that too absurd?
--- /dev/null +++ b/src/backend/storage/ipc/dsm.c +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state" +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?
+struct dsm_segment +{ + dlist_node node; /* List link in dsm_segment_list. */ + ResourceOwner resowner; /* Resource owner. */ + dsm_handle handle; /* Segment name. */ + uint32 control_slot; /* Slot in control segment. */ + void *impl_private; /* Implementation-specific private data. */ + void *mapped_address; /* Mapping address, or NULL if unmapped. */ + uint64 mapped_size; /* Size of our mapping. */ +};
Document that's backend local?
+typedef struct dsm_control_item +{ + dsm_handle handle; + uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */ +} dsm_control_item; + +typedef struct dsm_control_header +{ + uint32 magic; + uint32 nitems; + uint32 maxitems; + dsm_control_item item[FLEXIBLE_ARRAY_MEMBER]; +} dsm_control_header;
And those are shared memory?
+static void dsm_cleanup_using_control_segment(void); +static void dsm_cleanup_for_mmap(void); +static bool dsm_read_state_file(dsm_handle *h); +static void dsm_write_state_file(dsm_handle h); +static void dsm_postmaster_shutdown(int code, Datum arg); +static void dsm_backend_shutdown(int code, Datum arg); +static dsm_segment *dsm_create_descriptor(void); +static bool dsm_control_segment_sane(dsm_control_header *control, + uint64 mapped_size); +static uint64 dsm_control_bytes_needed(uint32 nitems); + +/* Has this backend initialized the dynamic shared memory system yet? */ +static bool dsm_init_done = false; + +/* + * List of dynamic shared memory segments used by this backend. + * + * At process exit time, we must decrement the reference count of each + * segment we have attached; this list makes it possible to find all such + * segments. + * + * This list should always be empty in the postmaster. We could probably + * allow the postmaster to map dynamic shared memory segments before it + * begins to start child processes, provided that each process adjusted + * the reference counts for those segments in the control segment at + * startup time, but there's no obvious need for such a facility, which + * would also be complex to handle in the EXEC_BACKEND case. Once the + * postmaster has begun spawning children, there's an additional problem: + * each new mapping would require an update to the control segment, + * which requires locking, in which the postmaster must not be involved. + */ +static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list); + +/* + * Control segment information. + * + * Unlike ordinary shared memory segments, the control segment is not + * reference counted; instead, it lasts for the postmaster's entire + * life cycle. For simplicity, it doesn't have a dsm_segment object either. + */ +static dsm_handle dsm_control_handle; +static dsm_control_header *dsm_control; +static uint64 dsm_control_mapped_size = 0; +static void *dsm_control_impl_private = NULL; + +/* + * Start up the dynamic shared memory system. + * + * This is called just once during each cluster lifetime, at postmaster + * startup time. + */ +void +dsm_postmaster_startup(void) +{ + void *dsm_control_address = NULL; + uint32 maxitems; + uint64 segsize; + + Assert(!IsUnderPostmaster); + + /* If dynamic shared memory is disabled, there's nothing to do. */ + if (dynamic_shared_memory_type == DSM_IMPL_NONE) + return; + + /* + * Check for, and remove, shared memory segments left behind by a dead + * postmaster. This isn't necessary on Windows, which always removes them + * when the last reference is gone. + */ + switch (dynamic_shared_memory_type) + { + case DSM_IMPL_POSIX: + case DSM_IMPL_SYSV: + dsm_cleanup_using_control_segment(); + break; + case DSM_IMPL_MMAP: + dsm_cleanup_for_mmap(); + break; + case DSM_IMPL_WINDOWS: + /* Nothing to do. */ + break; + default: + elog(ERROR, "unknown dynamic shared memory type: %d", + dynamic_shared_memory_type); + } + + /* Determine size for new control segment. */ + maxitems = PG_DYNSHMEM_FIXED_SLOTS + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
It seems likely that MaxConnections would be sufficient?
+ elog(DEBUG2, "dynamic shared memory system will support %lu segments", + (unsigned long) maxitems); + segsize = dsm_control_bytes_needed(maxitems); + + /* Create new control segment. */ + for (;;) + { + Assert(dsm_control_address == NULL); + Assert(dsm_control_mapped_size == 0); + dsm_control_handle = random(); + if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize, + &dsm_control_impl_private, &dsm_control_address, + &dsm_control_mapped_size, ERROR)) + break; + }
Comment that we loop endlessly to find a unused identifier.
Why do we create the control segment in dynamic smem and not in the
normal shmem? Presumably because this way it has the same lifetime? If
so, that should be commented upon.
+static void +dsm_cleanup_using_control_segment(void) +{ + /* + * We've managed to reattach it, but the contents might not be sane. + * If they aren't, we disregard the segment after all. + */ + old_control = (dsm_control_header *) mapped_address; + if (!dsm_control_segment_sane(old_control, mapped_size)) + { + dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private, + &mapped_address, &mapped_size, LOG); + return; + }
So, we leave it just hanging around... Well, it has precedent in our
normal shared memory handling.
+static void +dsm_cleanup_for_mmap(void) +{
...
+}
I still maintain that the extra infrastructure required isn't worth the
gain of having the mmap implementation.
+/* + * Read and parse the state file. + * + * If the state file is empty or the contents are garbled, it probably means + * that the operating system rebooted before the data written by the previous + * postmaster made it to disk. In that case, we can just ignore it; any shared + * memory from before the reboot should be gone anyway. + */ +static bool +dsm_read_state_file(dsm_handle *h) +{
...
+ return true;
+}
Perhaps CRC32 the content?
+/* + * Write our control segment handle to the state file, so that if the + * postmaster is killed without running it's on_shmem_exit hooks, the + * next postmaster can clean things up after restart. + */ +static void +dsm_write_state_file(dsm_handle h) +{ + int statefd; + char statebuf[PG_DYNSHMEM_STATE_BUFSIZ]; + int nbytes; + + /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
Doesn't this need a | PG_BINARY? Why are you using open() and not
BasicOpenFile or even OpenTransientFile?
+ /* Write contents. */ + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n", + (unsigned long) dsm_control_handle);
Why are we upcasting the length of dsm_control_handle here? Also,
doesn't this need the usual UINT64_FORMAT thingy?
+/* + * At shutdown time, we iterate over the control segment and remove all + * remaining dynamic shared memory segments. We avoid throwing errors here; + * the postmaster is shutting down either way, and this is just non-critical + * resource cleanup. + */ +static void +dsm_postmaster_shutdown(int code, Datum arg) +{ + uint32 nitems; + uint32 i; + void *dsm_control_address; + void *junk_mapped_address = NULL; + void *junk_impl_private = NULL; + uint64 junk_mapped_size = 0; + + /* If dynamic shared memory is disabled, there's nothing to do. */ + if (dynamic_shared_memory_type == DSM_IMPL_NONE) + return;
But we don't even get calld in that case, right? You're only regitering
the on_shmem_exit() handler after the same check in startup?
+ /* + * If some other backend exited uncleanly, it might have corrupted the + * control segment while it was dying. In that case, we warn and ignore + * the contents of the control segment. This may end up leaving behind + * stray shared memory segments, but there's not much we can do about + * that if the metadata is gone. + */ + nitems = dsm_control->nitems; + if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size)) + { + ereport(LOG, + (errmsg("dynamic shared memory control segment is corrupt"))); + return; + }
I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)
+ /* Remove any remaining segments. */ + for (i = 0; i < nitems; ++i) + { + dsm_handle handle; + + /* If the reference count is 0, the slot is actually unused. */ + if (dsm_control->item[i].refcnt == 0) + continue; + + /* Log debugging information. */ + handle = dsm_control->item[i].handle; + elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu", + (unsigned long) handle);
I'd include the refcount here, might be helpful for debugging.
+ /* Destroy the segment. */ + dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private, + &junk_mapped_address, &junk_mapped_size, LOG); + } + + /* Remove the control segment itself. */ + elog(DEBUG2, + "cleaning up dynamic shared memory control segment with ID %lu", + (unsigned long) dsm_control_handle); + dsm_control_address = dsm_control; + dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0, + &dsm_control_impl_private, &dsm_control_address, + &dsm_control_mapped_size, LOG); + dsm_control = dsm_control_address; + + /* And, finally, remove the state file. */ + if (unlink(PG_DYNSHMEM_STATE_FILE) < 0) + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not unlink file \"%s\": %m", + PG_DYNSHMEM_STATE_FILE))); +}
Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.
+/* + * Prepare this backend for dynamic shared memory usage. Under EXEC_BACKEND, + * we must reread the state file and map the control segment; in other cases, + * we'll have inherited the postmaster's mapping and global variables. + */ +static void +dsm_backend_startup(void) +{ + +#ifdef EXEC_BACKEND + { + dsm_handle control_handle; + void *control_address = NULL; + + /* Read the control segment information from the state file. */ + if (!dsm_read_state_file(&control_handle)) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("could not parse dynamic shared memory state file"))); + + /* Attach control segment. */ + dsm_impl_op(DSM_OP_ATTACH, control_handle, 0, + &dsm_control_impl_private, &control_address, + &dsm_control_mapped_size, ERROR); + dsm_control_handle = control_handle; + dsm_control = control_address; + /* If control segment doesn't look sane, something is badly wrong. */ + if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size)) + { + dsm_impl_op(DSM_OP_DETACH, control_handle, 0, + &dsm_control_impl_private, &control_address, + &dsm_control_mapped_size, WARNING); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("dynamic shared memory control segment is not valid"))); + }
Imo that's a PANIC or at the very least a FATAL.
+ +/* + * Create a new dynamic shared memory segment. + */ +dsm_segment * +dsm_create(uint64 size) +{
...
+ /* Verify that we can support an additional mapping. */ + if (nitems >= dsm_control->maxitems) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("too many dynamic shared memory segments")));
Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...
+dsm_segment * +dsm_attach(dsm_handle h) +{
+ /* Bump reference count for this segment in shared memory. */ + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE); + nitems = dsm_control->nitems; + for (i = 0; i < nitems; ++i) + { + /* If the reference count is 0, the slot is actually unused. */ + if (dsm_control->item[i].refcnt == 0) + continue; + + /* + * If the reference count is 1, the slot is still in use, but the + * segment is in the process of going away. Treat that as if we + * didn't find a match. + */ + if (dsm_control->item[i].refcnt == 1) + break;
Why can we see that state? Shouldn't locking prevent that?
+/* + * Resize an existing shared memory segment. + * + * This may cause the shared memory segment to be remapped at a different + * address. For the caller's convenience, we return the mapped address. + */ +void * +dsm_resize(dsm_segment *seg, uint64 size) +{ + Assert(seg->control_slot != INVALID_CONTROL_SLOT); + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private, + &seg->mapped_address, &seg->mapped_size, ERROR); + return seg->mapped_address; +}
Hm. That's valid when there are other backends attached? What are the
implications for already attached ones?
Shouldn't we error out if !dsm_impl_can_resize()?
+/* + * Detach from a shared memory segment, destroying the segment if we + * remove the last reference. + * + * This function should never fail. It will often be invoked when aborting + * a transaction, and a further error won't serve any purpose. It's not a + * complete disaster if we fail to unmap or destroy the segment; it means a + * resource leak, but that doesn't necessarily preclude further operations. + */ +void +dsm_detach(dsm_segment *seg) +{
Why do we want to ignore errors like failing to unmap? ISTM that
indicates an actual problem...
+ /* Reduce reference count, if we previously increased it. */ + if (seg->control_slot != INVALID_CONTROL_SLOT) + { + uint32 refcnt; + uint32 control_slot = seg->control_slot; + + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE); + Assert(dsm_control->item[control_slot].handle == seg->handle); + Assert(dsm_control->item[control_slot].refcnt > 1); + refcnt = --dsm_control->item[control_slot].refcnt; + seg->control_slot = INVALID_CONTROL_SLOT; + LWLockRelease(DynamicSharedMemoryControlLock); + + /* If new reference count is 1, try to destroy the segment. */ + if (refcnt == 1) + { + /* + * If we fail to destroy the segment here, or are killed before + * we finish doing so, the reference count will remain at 1, which + * will mean that nobody else can attach to the segment. At + * postmaster shutdown time, or when a new postmaster is started + * after a hard kill, another attempt will be made to remove the + * segment. + * + * The main case we're worried about here is being killed by + * a signal before we can finish removing the segment. In that + * case, it's important to be sure that the segment still gets + * removed. If we actually fail to remove the segment for some + * other reason, the postmaster may not have any better luck than + * we did. There's not much we can do about that, though. + */ + if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, &seg->impl_private, + &seg->mapped_address, &seg->mapped_size, WARNING)) + { + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE); + Assert(dsm_control->item[control_slot].handle == seg->handle); + Assert(dsm_control->item[control_slot].refcnt == 1); + dsm_control->item[control_slot].refcnt = 0; + LWLockRelease(DynamicSharedMemoryControlLock); + } + } + }
Yuck. So that's the answer to my earlier question about the legality of
seing a refcount of 1....
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c new file mode 100644 index 0000000..8005fd9 @@ -0,0 +1,986 @@ +/*------------------------------------------------------------------------- + * + * dsm_impl.c + * manage dynamic shared memory segments + * + * This file provides low-level APIs for creating and destroying shared + * memory segments using several different possible techniques. We refer + * to these segments as dynamic because they can be created, altered, and + * destroyed at any point during the server life cycle. This is unlike + * the main shared memory segment, of which there is always exactly one + * and which is always mapped at a fixed address in every PostgreSQL + * background process. + * + * Because not all systems provide the same primitives in this area, nor + * do all primitives behave the saem way on all systems, we provide
*same
+ * several implementations of this facility. Many systems implement + * POSIX shared memory (shm_open etc.), which is well-suited to our needs + * in this area, with the exception that shared memory identifiers live + * in a flat system-wide namespace, raising the uncomfortable prospect of + * name collisions with other processes (including other copies of + * PostgreSQL) running on the same system.
Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.
+/* + * Does the current dynamic shared memory implementation support resizing + * segments? (The answer here could be platform-dependent in the future, + * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently + * can't resize segments to anything larger than 256MB that way. For now, + * we keep it simple.) + */ +bool +dsm_impl_can_resize(void) +{ + switch (dynamic_shared_memory_type) + { + case DSM_IMPL_NONE: + return false; + case DSM_IMPL_SYSV: + return false; + case DSM_IMPL_WINDOWS: + return false; + default: + return true; + } +}
Looks to me like the logic should be the reverse.
+#ifdef USE_DSM_POSIX +/* + * Operating system primitives to support POSIX shared memory. + * + * POSIX shared memory segments are created and attached using shm_open() + * and shm_unlink(); other operations, such as sizing or mapping the + * segment, are performed as if the shared memory segments were files. + * + * Indeed, on some platforms, they may be implemented that way. While + * POSIX shared memory segments seem intended to exist in a flat namespace, + * some operating systems may implement them as files, even going so far + * to treat a request for /xyz as a request to create a file by that name + * in the root directory. Users of such broken platforms should select + * a different shared memory implementation. + */ +static bool +dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size, + void **impl_private, void **mapped_address, uint64 *mapped_size, + int elevel) +{ + char name[64]; + int flags; + int fd; + char *address; + + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
Why wider than the handle?
+ /* + * If we're reattaching or resizing, we must remove any existing mapping, + * unless we've already got the right thing mapped. + */ + if (*mapped_address != NULL) + { + if (*mapped_size == request_size) + return true;
Hm. It could have gotten resized to the old size, or resized twice. In
that case it might not be at the same address before, so checking for
the size doesn't seem to be sufficient.
+static bool +dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size, + void **impl_private, void **mapped_address, uint64 *mapped_size, + int elevel) +{
+ /* + * There's one special key, IPC_PRIVATE, which can't be used. If we end + * up with that value by chance during a create operation, just pretend + * it already exists, so that caller will retry. If we run into it + * anywhere else, the caller has passed a handle that doesn't correspond + * to anything we ever created, which should not happen. + */ + if (key == IPC_PRIVATE) + { + if (op != DSM_OP_CREATE) + elog(elevel, "System V shared memory key may not be IPC_PRIVATE"); + errno = EEXIST; + return false; + }
Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
in ERROR?
+static bool +dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size, + void **impl_private, void **mapped_address, uint64 *mapped_size, + int elevel) +{
+ /* + * If we're reattaching or resizing, we must remove any existing mapping, + * unless we've already got the right thing mapped. + */ + if (*mapped_address != NULL) + { + if (*mapped_size == request_size) + return true;
Same think like in posix shmem.
+static int +errcode_for_dynamic_shared_memory() +{ + if (errno == EFBIG || errno == ENOMEM) + return errcode(ERRCODE_OUT_OF_MEMORY); + else + return errcode_for_file_access(); +}
Is EFBIG guaranteed to be defined?
+/* + * Forget that a temporary file is owned by a ResourceOwner + */ +void +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg) +{ + dsm_segment **dsms = owner->dsms; + int ns1 = owner->ndsms - 1; + int i; + + for (i = ns1; i >= 0; i--) + { + if (dsms[i] == seg) + { + while (i < ns1) + { + dsms[i] = dsms[i + 1]; + i++; + } + owner->ndsms = ns1; + return; + } + } + elog(ERROR, + "dynamic shared memory segment %lu is not owned by resource owner %s", + (unsigned long) dsm_segment_handle(seg), owner->name); +}
Not really an issue, but this will grow owner->dsm unnecessarily because
ResourceOwnerEnlargeDSMs() will have been done previously.
Not sure yet how happy I am with the separation of concerns between
dsm.c and dsm_impl.c...
That's it for now.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Maybe also check for shm_unlink or is that too absurd?
Well, if we find that that there are systems were shm_open is present
and shm_unlink is not present, then we'll need such a test. But I
hesitate to decide what the right thing to do on such systems without
knowing more. And the time it takes to run configure is a very
substantial percentage of the complete build time, at least on my box,
so I don't really want to add tests just because they might be needed
somewhere.
--- /dev/null +++ b/src/backend/storage/ipc/dsm.c +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state" +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?
Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.
Document that's backend local?
And those are shared memory?
Sure, done.
+ /* Determine size for new control segment. */ + maxitems = PG_DYNSHMEM_FIXED_SLOTS + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;It seems likely that MaxConnections would be sufficient?
I think we could argue about the best way to set this until the cows
come home, but I don't think it probably matters much at this point.
We can always change the formula later as we gain experience.
However, I don't have a principled reason for assuming that only
user-connected backends will create dynamic shared memory segments.
Comment that we loop endlessly to find a unused identifier.
Done.
Why do we create the control segment in dynamic smem and not in the
normal shmem? Presumably because this way it has the same lifetime? If
so, that should be commented upon.
If it were part of the normal shared memory segment, I don't think
there'd be any good way to implement
dsm_cleanup_using_control_segment().
So, we leave it just hanging around... Well, it has precedent in our
normal shared memory handling.
And, it could be long to some unrelated process.
I still maintain that the extra infrastructure required isn't worth the
gain of having the mmap implementation.
I know.
+/* + * Read and parse the state file. + *Perhaps CRC32 the content?
I don't see the point. If the file contents are garbage that happens
to look like a number, we'll go "oh, there isn't any such segment" or
"oh, there is such a segment but it doesn't look like a control
segment, so forget it". There are a lot of things we really ought to
be CRCing to avoid corruption risk, but I can't see how this is
remotely one of them.
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
Why are you using open() and not
BasicOpenFile or even OpenTransientFile?
Because those don't work in the postmaster.
+ /* Write contents. */ + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n", + (unsigned long) dsm_control_handle);Why are we upcasting the length of dsm_control_handle here? Also,
doesn't this need the usual UINT64_FORMAT thingy?
dsm_handle is an alias for uint32. Is that always exactly an unsigned
int or can it sometimes be an unsigned long? I thought the latter, so
couldn't figure out how to write this portably without casting to a
type that explicitly matched the format string.
But we don't even get calld in that case, right? You're only regitering
the on_shmem_exit() handler after the same check in startup?
True. Removed.
I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)
Meh. I write too many really long function names as it is.
I'd include the refcount here, might be helpful for debugging.
OK, done.
Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.
I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.
+ ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("dynamic shared memory control segment is not valid")));Imo that's a PANIC or at the very least a FATAL.
Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself.
+dsm_segment *
+dsm_create(uint64 size)Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...
Yes, we rely on that. I don't really see that as a problem. You'd
better connect to the main shared memory segment before starting to
create your own.
Why can we see that state? Shouldn't locking prevent that?
We initialize the refcnt to 2 on creation, so that it ends up as 1
when the last reference is gone. This prevents race conditions:
imagine that one process creates a DSM and then passes the handle to
some other process. But, before that second process gets around to
mapping it, the first process dies. At that point, there are no
remaining references to the dynamic shared memory segment, so it goes
away, as per spec. But what if we're in the middle of destroying when
the second process finally gets around to trying to map it? In such a
situation, the behavior would be implementation-specific. I felt that
it was better not to find out whether all operating systems and shared
memory types handle such cases similarly, because I'm pretty sure they
don't. The two-stage destruction process means that once we've
committed to destroying the segment, no one else will attempt to map
it.
+void * +dsm_resize(dsm_segment *seg, uint64 size) +{ + Assert(seg->control_slot != INVALID_CONTROL_SLOT); + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private, + &seg->mapped_address, &seg->mapped_size, ERROR); + return seg->mapped_address; +}Hm. That's valid when there are other backends attached? What are the
implications for already attached ones?
They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.
Shouldn't we error out if !dsm_impl_can_resize()?
The implementation-specific code throws an error if it can't support
resize. Even if we put a secondary check here, I wouldn't want
dsm_impl_op to behave in an undefined manner when asked to resize
under an implementation that can't. And there doesn't seem to be much
point in having two checks.
+void +dsm_detach(dsm_segment *seg) +{Why do we want to ignore errors like failing to unmap? ISTM that
indicates an actual problem...
Sure it does. But what are you going to do about it? In many cases,
you're going to get here during a transaction abort caused by some
other error. If the transaction is already aborting, throwing an
error here will just cause the original error to get discarded in
favor of showing this one, or maybe it's the other way around. I
don't remember, but it's definitely one or the other, and neither is
desirable. Throwing a warning, on the other hand, will notify the
user, which is what we want.
Now on the flip side we might not be aborting; maybe we're committing.
But we don't want to turn a commit into an abort just for this. If
resowner.c detects a buffer pin leak or a tuple descriptor leak, those
are "just" warning as well. They're serious warnings, of course, and
if they happen it means there's a bug in the code that needs to be
fixed. But the severity of an ereport() isn't based just on how
alarming the situation is; it's based on what you want to happen when
that situation comes up. And we've decided (correctly, I think) that
resource leaks are not grounds for aborting a transaction that
otherwise would have committed.
Yuck. So that's the answer to my earlier question about the legality of
seing a refcount of 1....
Read it and weep.
*same
Fixed, thanks.
+ * several implementations of this facility. Many systems implement + * POSIX shared memory (shm_open etc.), which is well-suited to our needs + * in this area, with the exception that shared memory identifiers live + * in a flat system-wide namespace, raising the uncomfortable prospect of + * name collisions with other processes (including other copies of + * PostgreSQL) running on the same system.Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.
According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."
http://www.unix.com/man-page/OpenSolaris/3c/shm_open/
I'm unclear whether there are any real systems that have a problem with this.
+bool +dsm_impl_can_resize(void) +{ + switch (dynamic_shared_memory_type) + { + case DSM_IMPL_NONE: + return false; + case DSM_IMPL_SYSV: + return false; + case DSM_IMPL_WINDOWS: + return false; + default: + return true; + } +}Looks to me like the logic should be the reverse.
I've changed it to list all the cases explicitly.
+ char name[64]; + int flags; + int fd; + char *address; + + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);Why wider than the handle?
Same as above - not sure that uint32 == unsigned int everywhere.
+ /* + * If we're reattaching or resizing, we must remove any existing mapping, + * unless we've already got the right thing mapped. + */ + if (*mapped_address != NULL) + { + if (*mapped_size == request_size) + return true;Hm. It could have gotten resized to the old size, or resized twice. In
that case it might not be at the same address before, so checking for
the size doesn't seem to be sufficient.
I don't understand your concern. If someone resize the DSM to its
already-current size, there is no need to remap it. The old mapping
is just fine. And if some other backend resizes the DSM to a larger
size and then back to the original size, and then we're asked to
update the mapping, there is no need to change anything.
Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
in ERROR?
Oh, that's bad. Fixed.
+static int +errcode_for_dynamic_shared_memory() +{ + if (errno == EFBIG || errno == ENOMEM) + return errcode(ERRCODE_OUT_OF_MEMORY); + else + return errcode_for_file_access(); +}Is EFBIG guaranteed to be defined?
I dunno. We could put an #ifdef around that part. Should we do that
now or wait and see if it actually breaks anywhere?
Not really an issue, but this will grow owner->dsm unnecessarily because
ResourceOwnerEnlargeDSMs() will have been done previously.
I tried to copy the existing uses of resowner.c as closely as
possible; if you think that there's something I should be doing to
mimic it more closely, let me know.
Not sure yet how happy I am with the separation of concerns between
dsm.c and dsm_impl.c...
I was hoping for a cleaner abstraction break, but I couldn't make it
work out any better than this. Even so, I think it's worthwhile
having two files; an imperfect separation of concerns still seems
better than concatenating them into one really long file. (FWIW, the
combined file would be longer than 84% of the 622 .c files in the
backend; as a fan of keeping .c files relatively small, I'm not eager
to be the cause of us having more large ones.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
dynshmem-v3.patchapplication/octet-stream; name=dynshmem-v3.patchDownload
diff --git a/configure b/configure
index c685ca3..97d2f68 100755
--- a/configure
+++ b/configure
@@ -8384,6 +8384,180 @@ if test "$ac_res" != no; then
fi
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_open" >&5
+$as_echo_n "checking for library containing shm_open... " >&6; }
+if test "${ac_cv_search_shm_open+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_open ();
+int
+main ()
+{
+return shm_open ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_open=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_open+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_open+set}" = set; then
+ :
+else
+ ac_cv_search_shm_open=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_open" >&5
+$as_echo "$ac_cv_search_shm_open" >&6; }
+ac_res=$ac_cv_search_shm_open
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
+{ $as_echo "$as_me:$LINENO: checking for library containing shm_unlink" >&5
+$as_echo_n "checking for library containing shm_unlink... " >&6; }
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ $as_echo_n "(cached) " >&6
+else
+ ac_func_search_save_LIBS=$LIBS
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h. */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char shm_unlink ();
+int
+main ()
+{
+return shm_unlink ();
+ ;
+ return 0;
+}
+_ACEOF
+for ac_lib in '' rt; do
+ if test -z "$ac_lib"; then
+ ac_res="none required"
+ else
+ ac_res=-l$ac_lib
+ LIBS="-l$ac_lib $ac_func_search_save_LIBS"
+ fi
+ rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+ *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+ *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+ (eval "$ac_link") 2>conftest.er1
+ ac_status=$?
+ grep -v '^ *+' conftest.er1 >conftest.err
+ rm -f conftest.er1
+ cat conftest.err >&5
+ $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+ (exit $ac_status); } && {
+ test -z "$ac_c_werror_flag" ||
+ test ! -s conftest.err
+ } && test -s conftest$ac_exeext && {
+ test "$cross_compiling" = yes ||
+ $as_test_x conftest$ac_exeext
+ }; then
+ ac_cv_search_shm_unlink=$ac_res
+else
+ $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+ conftest$ac_exeext
+ if test "${ac_cv_search_shm_unlink+set}" = set; then
+ break
+fi
+done
+if test "${ac_cv_search_shm_unlink+set}" = set; then
+ :
+else
+ ac_cv_search_shm_unlink=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:$LINENO: result: $ac_cv_search_shm_unlink" >&5
+$as_echo "$ac_cv_search_shm_unlink" >&6; }
+ac_res=$ac_cv_search_shm_unlink
+if test "$ac_res" != no; then
+ test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
# Solaris:
{ $as_echo "$as_me:$LINENO: checking for library containing fdatasync" >&5
$as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -19763,7 +19937,8 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
do
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
{ $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
diff --git a/configure.in b/configure.in
index 82771bd..ead0908 100644
--- a/configure.in
+++ b/configure.in
@@ -883,6 +883,8 @@ case $host_os in
esac
AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
AC_SEARCH_LIBS(crypt, crypt)
+AC_SEARCH_LIBS(shm_open, rt)
+AC_SEARCH_LIBS(shm_unlink, rt)
# Solaris:
AC_SEARCH_LIBS(fdatasync, [rt posix4])
# Required for thread_test.c on Solaris 2.5:
@@ -1230,7 +1232,7 @@ PGAC_FUNC_GETTIMEOFDAY_1ARG
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/contrib/dsm_demo/Makefile b/contrib/dsm_demo/Makefile
new file mode 100644
index 0000000..dd9ea92
--- /dev/null
+++ b/contrib/dsm_demo/Makefile
@@ -0,0 +1,17 @@
+# contrib/dsm_demo/Makefile
+
+MODULES = dsm_demo
+
+EXTENSION = dsm_demo
+DATA = dsm_demo--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/dsm_demo
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/dsm_demo/dsm_demo--1.0.sql b/contrib/dsm_demo/dsm_demo--1.0.sql
new file mode 100644
index 0000000..7ad6ab1
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/dsm_demo/dsm_demo--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION dsm_demo" to load this file. \quit
+
+CREATE FUNCTION dsm_demo_create(pg_catalog.text)
+RETURNS pg_catalog.int8 STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION dsm_demo_read(pg_catalog.int8)
+RETURNS pg_catalog.text STRICT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/dsm_demo/dsm_demo.c b/contrib/dsm_demo/dsm_demo.c
new file mode 100644
index 0000000..0ebbd68
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.c
@@ -0,0 +1,97 @@
+/* -------------------------------------------------------------------------
+ *
+ * dsm_demo.c
+ * Dynamic shared memory demonstration.
+ *
+ * Copyright (C) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/dsm_demo/dsm_demo.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+#include "fmgr.h"
+
+PG_MODULE_MAGIC;
+
+void _PG_init(void);
+Datum dsm_demo_create(PG_FUNCTION_ARGS);
+Datum dsm_demo_read(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(dsm_demo_create);
+PG_FUNCTION_INFO_V1(dsm_demo_read);
+
+#define DSM_DEMO_MAGIC 0x44454D4F
+
+typedef struct
+{
+ uint32 magic;
+ int32 len;
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} dsm_demo_payload;
+
+Datum
+dsm_demo_create(PG_FUNCTION_ARGS)
+{
+ text *txt = PG_GETARG_TEXT_PP(0);
+ int len = VARSIZE_ANY(txt);
+ uint64 seglen;
+ dsm_segment *seg;
+ dsm_demo_payload *payload;
+
+ seglen = offsetof(dsm_demo_payload, data) + len;
+ seg = dsm_create(seglen);
+ dsm_keep_mapping(seg);
+
+ payload = dsm_segment_address(seg);
+ payload->magic = DSM_DEMO_MAGIC;
+ payload->len = len;
+ memcpy(payload->data, txt, len);
+
+ PG_RETURN_INT64(dsm_segment_handle(seg));
+}
+
+Datum
+dsm_demo_read(PG_FUNCTION_ARGS)
+{
+ dsm_handle h = PG_GETARG_INT64(0);
+ dsm_segment *seg;
+ bool needs_detach = false;
+ text *txt = NULL;
+ dsm_demo_payload *payload;
+
+ /*
+ * We could be called from the same sesion that called dsm_demo_create(),
+ * so search for an existing mapping. If we don't find one, attach the
+ * segment.
+ */
+ seg = dsm_find_mapping(h);
+ if (seg == NULL)
+ {
+ seg = dsm_attach(h);
+ if (!seg)
+ PG_RETURN_NULL();
+ needs_detach = true;
+ }
+
+ /* Extract data, after checking magic number. */
+ payload = dsm_segment_address(seg);
+ if (payload->magic == DSM_DEMO_MAGIC)
+ {
+ txt = palloc(payload->len);
+ memcpy(txt, payload->data, payload->len);
+ }
+
+ /* Detach, if there was no existing mapping. */
+ if (needs_detach)
+ dsm_detach(seg);
+
+ if (txt == NULL)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TEXT_P(txt);
+}
diff --git a/contrib/dsm_demo/dsm_demo.control b/contrib/dsm_demo/dsm_demo.control
new file mode 100644
index 0000000..4060791
--- /dev/null
+++ b/contrib/dsm_demo/dsm_demo.control
@@ -0,0 +1,5 @@
+# dsm_demo extension
+comment = 'Dynamic shared memory demonstration'
+default_version = '1.0'
+module_pathname = 'dsm_demo'
+relocatable = true
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 20e3c32..b604407 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -29,6 +29,7 @@
#endif
#include "miscadmin.h"
+#include "portability/mem.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
@@ -36,31 +37,6 @@
typedef key_t IpcMemoryKey; /* shared memory key passed to shmget(2) */
typedef int IpcMemoryId; /* shared memory ID returned by shmget(2) */
-#define IPCProtection (0600) /* access/modify by user only */
-
-#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
-#define PG_SHMAT_FLAGS SHM_SHARE_MMU
-#else
-#define PG_SHMAT_FLAGS 0
-#endif
-
-/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
-#ifndef MAP_ANONYMOUS
-#define MAP_ANONYMOUS MAP_ANON
-#endif
-
-/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
-#ifndef MAP_HASSEMAPHORE
-#define MAP_HASSEMAPHORE 0
-#endif
-
-#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
-
-/* Some really old systems don't define MAP_FAILED. */
-#ifndef MAP_FAILED
-#define MAP_FAILED ((void *) -1)
-#endif
-
unsigned long UsedShmemSegID = 0;
void *UsedShmemSegAddr = NULL;
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 743f30e..873dd60 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -15,7 +15,7 @@ override CFLAGS+= -fno-inline
endif
endif
-OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
- sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
+ shmem.o shmqueue.o sinval.o sinvaladt.o standby.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
new file mode 100644
index 0000000..fa9e003
--- /dev/null
+++ b/src/backend/storage/ipc/dsm.c
@@ -0,0 +1,976 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides a set of services to make programming with dynamic
+ * shared memory segments more convenient. Unlike the low-level
+ * facilities provided by dsm_impl.h and dsm_impl.c, mappings and segments
+ * created using this module will be cleaned up automatically. Mappings
+ * will be removed when the resource owner under which they were created
+ * is cleaned up, unless dsm_keep_mapping() is used, in which case they
+ * have session lifespan. Segments will be removed when there are no
+ * remaining mappings, or at postmaster shutdown in any case. After a
+ * hard postmaster crash, remaining segments will be removed, if they
+ * still exist, at the next postmaster startup.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
+#include <sys/stat.h>
+
+#include "lib/ilist.h"
+#include "miscadmin.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner_private.h"
+
+#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
+#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
+#define PG_DYNSHMEM_STATE_BUFSIZ 512
+#define PG_DYNSHMEM_CONTROL_MAGIC 0x9a503d32
+
+/*
+ * There's no point in getting too cheap here, because the minimum allocation
+ * is one OS page, which is probably at least 4KB and could easily be as high
+ * as 64KB. Each currently sizeof(dsm_control_item), currently 8 bytes.
+ */
+#define PG_DYNSHMEM_FIXED_SLOTS 64
+#define PG_DYNSHMEM_SLOTS_PER_BACKEND 2
+
+#define INVALID_CONTROL_SLOT ((uint32) -1)
+
+/* Backend-local state for a dynamic shared memory segment. */
+struct dsm_segment
+{
+ dlist_node node; /* List link in dsm_segment_list. */
+ ResourceOwner resowner; /* Resource owner. */
+ dsm_handle handle; /* Segment name. */
+ uint32 control_slot; /* Slot in control segment. */
+ void *impl_private; /* Implementation-specific private data. */
+ void *mapped_address; /* Mapping address, or NULL if unmapped. */
+ uint64 mapped_size; /* Size of our mapping. */
+};
+
+/* Shared-memory state for a dynamic shared memory segment. */
+typedef struct dsm_control_item
+{
+ dsm_handle handle;
+ uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */
+} dsm_control_item;
+
+/* Layout of the dynamic shared memory control segment. */
+typedef struct dsm_control_header
+{
+ uint32 magic;
+ uint32 nitems;
+ uint32 maxitems;
+ dsm_control_item item[FLEXIBLE_ARRAY_MEMBER];
+} dsm_control_header;
+
+static void dsm_cleanup_using_control_segment(void);
+static void dsm_cleanup_for_mmap(void);
+static bool dsm_read_state_file(dsm_handle *h);
+static void dsm_write_state_file(dsm_handle h);
+static void dsm_postmaster_shutdown(int code, Datum arg);
+static void dsm_backend_shutdown(int code, Datum arg);
+static dsm_segment *dsm_create_descriptor(void);
+static bool dsm_control_segment_sane(dsm_control_header *control,
+ uint64 mapped_size);
+static uint64 dsm_control_bytes_needed(uint32 nitems);
+
+/* Has this backend initialized the dynamic shared memory system yet? */
+static bool dsm_init_done = false;
+
+/*
+ * List of dynamic shared memory segments used by this backend.
+ *
+ * At process exit time, we must decrement the reference count of each
+ * segment we have attached; this list makes it possible to find all such
+ * segments.
+ *
+ * This list should always be empty in the postmaster. We could probably
+ * allow the postmaster to map dynamic shared memory segments before it
+ * begins to start child processes, provided that each process adjusted
+ * the reference counts for those segments in the control segment at
+ * startup time, but there's no obvious need for such a facility, which
+ * would also be complex to handle in the EXEC_BACKEND case. Once the
+ * postmaster has begun spawning children, there's an additional problem:
+ * each new mapping would require an update to the control segment,
+ * which requires locking, in which the postmaster must not be involved.
+ */
+static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list);
+
+/*
+ * Control segment information.
+ *
+ * Unlike ordinary shared memory segments, the control segment is not
+ * reference counted; instead, it lasts for the postmaster's entire
+ * life cycle. For simplicity, it doesn't have a dsm_segment object either.
+ */
+static dsm_handle dsm_control_handle;
+static dsm_control_header *dsm_control;
+static uint64 dsm_control_mapped_size = 0;
+static void *dsm_control_impl_private = NULL;
+
+/*
+ * Start up the dynamic shared memory system.
+ *
+ * This is called just once during each cluster lifetime, at postmaster
+ * startup time.
+ */
+void
+dsm_postmaster_startup(void)
+{
+ void *dsm_control_address = NULL;
+ uint32 maxitems;
+ uint64 segsize;
+
+ Assert(!IsUnderPostmaster);
+
+ /* If dynamic shared memory is disabled, there's nothing to do. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ return;
+
+ /*
+ * Check for, and remove, shared memory segments left behind by a dead
+ * postmaster. This isn't necessary on Windows, which always removes them
+ * when the last reference is gone.
+ */
+ switch (dynamic_shared_memory_type)
+ {
+ case DSM_IMPL_POSIX:
+ case DSM_IMPL_SYSV:
+ dsm_cleanup_using_control_segment();
+ break;
+ case DSM_IMPL_MMAP:
+ dsm_cleanup_for_mmap();
+ break;
+ case DSM_IMPL_WINDOWS:
+ /* Nothing to do. */
+ break;
+ default:
+ elog(ERROR, "unknown dynamic shared memory type: %d",
+ dynamic_shared_memory_type);
+ }
+
+ /* Determine size for new control segment. */
+ maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
+ elog(DEBUG2, "dynamic shared memory system will support %lu segments",
+ (unsigned long) maxitems);
+ segsize = dsm_control_bytes_needed(maxitems);
+
+ /* Loop until we find an unused identifier for the new control segment. */
+ for (;;)
+ {
+ Assert(dsm_control_address == NULL);
+ Assert(dsm_control_mapped_size == 0);
+ dsm_control_handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
+ &dsm_control_impl_private, &dsm_control_address,
+ &dsm_control_mapped_size, ERROR))
+ break;
+ }
+ dsm_control = dsm_control_address;
+ on_shmem_exit(dsm_postmaster_shutdown, 0);
+ elog(DEBUG2, "created dynamic shared memory control segment %lu ("
+ UINT64_FORMAT " bytes)", (unsigned long) dsm_control_handle,
+ segsize);
+ dsm_write_state_file(dsm_control_handle);
+
+ /* Initialize control segment. */
+ dsm_control->magic = PG_DYNSHMEM_CONTROL_MAGIC;
+ dsm_control->nitems = 0;
+ dsm_control->maxitems = maxitems;
+}
+
+/*
+ * Determine whether the control segment from the previous postmaster
+ * invocation still exists. If so, remove the dynamic shared memory
+ * segments to which it refers, and then the control segment itself.
+ */
+static void
+dsm_cleanup_using_control_segment(void)
+{
+ void *mapped_address = NULL;
+ void *junk_mapped_address = NULL;
+ void *impl_private = NULL;
+ void *junk_impl_private = NULL;
+ uint64 mapped_size = 0;
+ uint64 junk_mapped_size = 0;
+ uint32 nitems;
+ uint32 i;
+ dsm_handle old_control_handle;
+ dsm_control_header *old_control;
+
+ /*
+ * Read the state file. If it doesn't exist or is empty, there's nothing
+ * more to do.
+ */
+ if (!dsm_read_state_file(&old_control_handle))
+ return;
+
+ /*
+ * Try to attach the segment. If this fails, it probably just means that
+ * the operating system has been rebooted and the segment no longer exists,
+ * or an unrelated proces has used the same shm ID. So just fall out
+ * quietly.
+ */
+ if (!dsm_impl_op(DSM_OP_ATTACH, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, DEBUG1))
+ return;
+
+ /*
+ * We've managed to reattach it, but the contents might not be sane.
+ * If they aren't, we disregard the segment after all.
+ */
+ old_control = (dsm_control_header *) mapped_address;
+ if (!dsm_control_segment_sane(old_control, mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, LOG);
+ return;
+ }
+
+ /*
+ * OK, the control segment looks basically valid, so we can get use
+ * it to get a list of segments that need to be removed.
+ */
+ nitems = old_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+ uint32 refcnt;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ refcnt = old_control->item[i].refcnt;
+ if (refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = old_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu (reference count %lu)",
+ (unsigned long) handle, (unsigned long) refcnt);
+
+ /* Destroy the referenced segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Destroy the old control segment, too. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) old_control_handle);
+ dsm_impl_op(DSM_OP_DESTROY, old_control_handle, 0, &impl_private,
+ &mapped_address, &mapped_size, LOG);
+}
+
+/*
+ * When we're using the mmap shared memory implementation, "shared memory"
+ * segments might even manage to survive an operating system reboot.
+ * But there's no guarantee as to exactly what will survive: some segments
+ * may survive, and others may not, and the contents of some may be out
+ * of date. In particular, the control segment may be out of date, so we
+ * can't rely on it to figure out what to remove. However, since we know
+ * what directory contains the files we used as shared memory, we can simply
+ * scan the directory and blow everything away that shouldn't be there.
+ */
+static void
+dsm_cleanup_for_mmap(void)
+{
+ DIR *dir;
+ struct dirent *dent;
+
+ /* Open the directory; can't use AllocateDir in postmaster. */
+ if ((dir = opendir(PG_DYNSHMEM_DIR)) == NULL)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open directory \"%s\": %m",
+ PG_DYNSHMEM_DIR)));
+
+ /* Scan for something with a name of the correct format. */
+ while ((dent = readdir(dir)) != NULL)
+ {
+ if (strncmp(dent->d_name, PG_DYNSHMEM_MMAP_FILE_PREFIX,
+ strlen(PG_DYNSHMEM_MMAP_FILE_PREFIX)) == 0)
+ {
+ char buf[MAXPGPATH];
+ snprintf(buf, MAXPGPATH, PG_DYNSHMEM_DIR "/%s", dent->d_name);
+
+ elog(DEBUG2, "removing file \"%s\"", buf);
+
+ /* We found a matching file; so remove it. */
+ if (unlink(buf) != 0)
+ {
+ int save_errno;
+
+ save_errno = errno;
+ closedir(dir);
+ errno = save_errno;
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", buf)));
+ }
+ }
+ }
+
+ /* Cleanup complete. */
+ closedir(dir);
+}
+
+/*
+ * Read and parse the state file.
+ *
+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk. In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.
+ */
+static bool
+dsm_read_state_file(dsm_handle *h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes = 0;
+ char *endptr,
+ *s;
+ dsm_handle handle;
+
+ /* Read the state file to get the ID of the old control segment. */
+ statefd = open(PG_DYNSHMEM_STATE_FILE, O_RDONLY, 0);
+ if (statefd < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ }
+ nbytes = read(statefd, statebuf, PG_DYNSHMEM_STATE_BUFSIZ - 1);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+ /* make sure buffer is NUL terminated */
+ statebuf[nbytes] = '\0';
+ close(statefd);
+
+ /*
+ * We expect to find the handle of the old control segment here,
+ * on a line by itself.
+ */
+ handle = strtoul(statebuf, &endptr, 10);
+ for (s = endptr; *s == ' ' || *s == '\t'; ++s)
+ ;
+ if (*s != '\n' && *s != '\0')
+ return false;
+
+ /* Looks good. */
+ *h = handle;
+ return true;
+}
+
+/*
+ * Write our control segment handle to the state file, so that if the
+ * postmaster is killed without running it's on_shmem_exit hooks, the
+ * next postmaster can clean things up after restart.
+ */
+static void
+dsm_write_state_file(dsm_handle h)
+{
+ int statefd;
+ char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+ int nbytes;
+
+ /* Create or truncate the file. */
+ statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
+ if (statefd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+
+ /* Write contents. */
+ snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
+ (unsigned long) dsm_control_handle);
+ nbytes = strlen(statebuf);
+ if (write(statefd, statebuf, nbytes) != nbytes)
+ {
+ if (errno == 0)
+ errno = ENOSPC; /* if no error signalled, assume no space */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+ }
+
+ /* Close file. */
+ close(statefd);
+
+ /*
+ * Atomically rename file into place, so that no one ever sees a partially
+ * written state file.
+ */
+ if (rename(PG_DYNSHMEM_NEW_STATE_FILE, PG_DYNSHMEM_STATE_FILE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not rename file \"%s\": %m",
+ PG_DYNSHMEM_NEW_STATE_FILE)));
+}
+
+/*
+ * At shutdown time, we iterate over the control segment and remove all
+ * remaining dynamic shared memory segments. We avoid throwing errors here;
+ * the postmaster is shutting down either way, and this is just non-critical
+ * resource cleanup.
+ */
+static void
+dsm_postmaster_shutdown(int code, Datum arg)
+{
+ uint32 nitems;
+ uint32 i;
+ void *dsm_control_address;
+ void *junk_mapped_address = NULL;
+ void *junk_impl_private = NULL;
+ uint64 junk_mapped_size = 0;
+
+ /*
+ * If some other backend exited uncleanly, it might have corrupted the
+ * control segment while it was dying. In that case, we warn and ignore
+ * the contents of the control segment. This may end up leaving behind
+ * stray shared memory segments, but there's not much we can do about
+ * that if the metadata is gone.
+ */
+ nitems = dsm_control->nitems;
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ ereport(LOG,
+ (errmsg("dynamic shared memory control segment is corrupt")));
+ return;
+ }
+
+ /* Remove any remaining segments. */
+ for (i = 0; i < nitems; ++i)
+ {
+ dsm_handle handle;
+
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /* Log debugging information. */
+ handle = dsm_control->item[i].handle;
+ elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+ (unsigned long) handle);
+
+ /* Destroy the segment. */
+ dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
+ &junk_mapped_address, &junk_mapped_size, LOG);
+ }
+
+ /* Remove the control segment itself. */
+ elog(DEBUG2,
+ "cleaning up dynamic shared memory control segment with ID %lu",
+ (unsigned long) dsm_control_handle);
+ dsm_control_address = dsm_control;
+ dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0,
+ &dsm_control_impl_private, &dsm_control_address,
+ &dsm_control_mapped_size, LOG);
+ dsm_control = dsm_control_address;
+
+ /* And, finally, remove the state file. */
+ if (unlink(PG_DYNSHMEM_STATE_FILE) < 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not unlink file \"%s\": %m",
+ PG_DYNSHMEM_STATE_FILE)));
+}
+
+/*
+ * Prepare this backend for dynamic shared memory usage. Under EXEC_BACKEND,
+ * we must reread the state file and map the control segment; in other cases,
+ * we'll have inherited the postmaster's mapping and global variables.
+ */
+static void
+dsm_backend_startup(void)
+{
+ /* If dynamic shared memory is disabled, reject this. */
+ if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("dynamic shared memory is disabled"),
+ errhint("Set dynamic_shared_memory_type to a value other than \"one\".")));
+
+#ifdef EXEC_BACKEND
+ {
+ dsm_handle control_handle;
+ void *control_address = NULL;
+
+ /* Read the control segment information from the state file. */
+ if (!dsm_read_state_file(&control_handle))
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("could not parse dynamic shared memory state file")));
+
+ /* Attach control segment. */
+ dsm_impl_op(DSM_OP_ATTACH, control_handle, 0,
+ &dsm_control_impl_private, &control_address,
+ &dsm_control_mapped_size, ERROR);
+ dsm_control_handle = control_handle;
+ dsm_control = control_address;
+ /* If control segment doesn't look sane, something is badly wrong. */
+ if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+ {
+ dsm_impl_op(DSM_OP_DETACH, control_handle, 0,
+ &dsm_control_impl_private, &control_address,
+ &dsm_control_mapped_size, WARNING);
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("dynamic shared memory control segment is not valid")));
+ }
+ }
+#endif
+
+ /* Arrange to detach segments on exit. */
+ on_shmem_exit(dsm_backend_shutdown, 0);
+
+ dsm_init_done = true;
+}
+
+/*
+ * Create a new dynamic shared memory segment.
+ */
+dsm_segment *
+dsm_create(uint64 size)
+{
+ dsm_segment *seg = dsm_create_descriptor();
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /* Loop until we find an unused segment identifier. */
+ for (;;)
+ {
+ Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
+ seg->handle = random();
+ if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR))
+ break;
+ }
+
+ /* Lock the control segment so we can register the new segment. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+
+ /* Search the control segment for an unused slot. */
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ if (dsm_control->item[i].refcnt == 0)
+ {
+ dsm_control->item[i].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[i].refcnt = 2;
+ seg->control_slot = i;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ return seg;
+ }
+ }
+
+ /* Verify that we can support an additional mapping. */
+ if (nitems >= dsm_control->maxitems)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("too many dynamic shared memory segments")));
+
+ /* Enter the handle into a new array slot. */
+ dsm_control->item[nitems].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[nitems].refcnt = 2;
+ seg->control_slot = nitems;
+ dsm_control->nitems++;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ return seg;
+}
+
+/*
+ * Attach a dynamic shared memory segment.
+ *
+ * See comments for dsm_segment_handle() for an explanation of how this
+ * is intended to be used.
+ *
+ * This function will return NULL if the segment isn't known to the system.
+ * This can happen if we're asked to attach the segment, but then everyone
+ * else detaches it (causing it to be destroyed) before we get around to
+ * attaching it.
+ */
+dsm_segment *
+dsm_attach(dsm_handle h)
+{
+ dsm_segment *seg;
+ dlist_iter iter;
+ uint32 i;
+ uint32 nitems;
+
+ /* Unsafe in postmaster (and pointless in a stand-alone backend). */
+ Assert(IsUnderPostmaster);
+
+ if (!dsm_init_done)
+ dsm_backend_startup();
+
+ /*
+ * Since this is just a debugging cross-check, we could leave it out
+ * altogether, or include it only in assert-enabled builds. But since
+ * the list of attached segments should normally be very short, let's
+ * include it always for right now.
+ *
+ * If you're hitting this error, you probably want to attempt to
+ * find an existing mapping via dsm_find_mapping() before calling
+ * dsm_attach() to create a new one.
+ */
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ elog(ERROR, "can't attach the same segment more than once");
+ }
+
+ /* Create a new segment descriptor. */
+ seg = dsm_create_descriptor();
+ seg->handle = h;
+
+ /* Bump reference count for this segment in shared memory. */
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ nitems = dsm_control->nitems;
+ for (i = 0; i < nitems; ++i)
+ {
+ /* If the reference count is 0, the slot is actually unused. */
+ if (dsm_control->item[i].refcnt == 0)
+ continue;
+
+ /*
+ * If the reference count is 1, the slot is still in use, but the
+ * segment is in the process of going away. Treat that as if we
+ * didn't find a match.
+ */
+ if (dsm_control->item[i].refcnt == 1)
+ break;
+
+ /* Otherwise, if the descriptor matches, we've found a match. */
+ if (dsm_control->item[i].handle == seg->handle)
+ {
+ dsm_control->item[i].refcnt++;
+ seg->control_slot = i;
+ break;
+ }
+ }
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /*
+ * If we didn't find the handle we're looking for in the control
+ * segment, it probably means that everyone else who had it mapped,
+ * including the original creator, died before we got to this point.
+ * It's up to the caller to decide what to do about that.
+ */
+ if (seg->control_slot == INVALID_CONTROL_SLOT)
+ {
+ dsm_detach(seg);
+ return NULL;
+ }
+
+ /* Here's where we actually try to map the segment. */
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg;
+}
+
+/*
+ * At backend shutdown time, detach any segments that are still attached.
+ */
+static void
+dsm_backend_shutdown(int code, Datum arg)
+{
+ while (!dlist_is_empty(&dsm_segment_list))
+ {
+ dsm_segment *seg;
+
+ seg = dlist_head_element(dsm_segment, node, &dsm_segment_list);
+ dsm_detach(seg);
+ }
+}
+
+/*
+ * Resize an existing shared memory segment.
+ *
+ * This may cause the shared memory segment to be remapped at a different
+ * address. For the caller's convenience, we return the mapped address.
+ */
+void *
+dsm_resize(dsm_segment *seg, uint64 size)
+{
+ Assert(seg->control_slot != INVALID_CONTROL_SLOT);
+ dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+ return seg->mapped_address;
+}
+
+/*
+ * Remap an existing shared memory segment.
+ *
+ * This is intended to be used when some other process has extended the
+ * mapping using dsm_resize(), but we've still only got the initial
+ * portion mapped. Since this might change the address at which the
+ * segment is mapped, we return the new mapped address.
+ */
+void *
+dsm_remap(dsm_segment *seg)
+{
+ if (!dsm_impl_can_resize())
+ return seg->mapped_address;
+
+ dsm_impl_op(DSM_OP_ATTACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, ERROR);
+
+ return seg->mapped_address;
+}
+
+/*
+ * Detach from a shared memory segment, destroying the segment if we
+ * remove the last reference.
+ *
+ * This function should never fail. It will often be invoked when aborting
+ * a transaction, and a further error won't serve any purpose. It's not a
+ * complete disaster if we fail to unmap or destroy the segment; it means a
+ * resource leak, but that doesn't necessarily preclude further operations.
+ */
+void
+dsm_detach(dsm_segment *seg)
+{
+ /*
+ * Try to remove the mapping, if one exists. Normally, there will be,
+ * but maybe not, if we failed partway through a create or attach
+ * operation. We remove the mapping before decrementing the reference
+ * count so that the process that sees a zero reference count can be
+ * certain that no remaining mappings exist. Even if this fails, we
+ * pretend that it works, because retrying is likely to fail in the
+ * same way.
+ */
+ if (seg->mapped_address != NULL)
+ {
+ dsm_impl_op(DSM_OP_DETACH, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, WARNING);
+ seg->impl_private = NULL;
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+ }
+
+ /* Reduce reference count, if we previously increased it. */
+ if (seg->control_slot != INVALID_CONTROL_SLOT)
+ {
+ uint32 refcnt;
+ uint32 control_slot = seg->control_slot;
+
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt > 1);
+ refcnt = --dsm_control->item[control_slot].refcnt;
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+
+ /* If new reference count is 1, try to destroy the segment. */
+ if (refcnt == 1)
+ {
+ /*
+ * If we fail to destroy the segment here, or are killed before
+ * we finish doing so, the reference count will remain at 1, which
+ * will mean that nobody else can attach to the segment. At
+ * postmaster shutdown time, or when a new postmaster is started
+ * after a hard kill, another attempt will be made to remove the
+ * segment.
+ *
+ * The main case we're worried about here is being killed by
+ * a signal before we can finish removing the segment. In that
+ * case, it's important to be sure that the segment still gets
+ * removed. If we actually fail to remove the segment for some
+ * other reason, the postmaster may not have any better luck than
+ * we did. There's not much we can do about that, though.
+ */
+ if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, &seg->impl_private,
+ &seg->mapped_address, &seg->mapped_size, WARNING))
+ {
+ LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+ Assert(dsm_control->item[control_slot].handle == seg->handle);
+ Assert(dsm_control->item[control_slot].refcnt == 1);
+ dsm_control->item[control_slot].refcnt = 0;
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ }
+ }
+ }
+
+ /* Clean up our remaining backend-private data structures. */
+ if (seg->resowner != NULL)
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ dlist_delete(&seg->node);
+ pfree(seg);
+}
+
+/*
+ * Keep a dynamic shared memory mapping until end of session.
+ *
+ * By default, mappings are owned by the current resource owner, which
+ * typically means they stick around for the duration of the current query
+ * only.
+ */
+void
+dsm_keep_mapping(dsm_segment *seg)
+{
+ if (seg->resowner != NULL)
+ {
+ ResourceOwnerForgetDSM(seg->resowner, seg);
+ seg->resowner = NULL;
+ }
+}
+
+/*
+ * Find an existing mapping for a shared memory segment, if there is one.
+ */
+dsm_segment *
+dsm_find_mapping(dsm_handle h)
+{
+ dlist_iter iter;
+ dsm_segment *seg;
+
+ dlist_foreach(iter, &dsm_segment_list)
+ {
+ seg = dlist_container(dsm_segment, node, iter.cur);
+ if (seg->handle == h)
+ return seg;
+ }
+
+ return NULL;
+}
+
+/*
+ * Get the address at which a dynamic shared memory segment is mapped.
+ */
+void *
+dsm_segment_address(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_address;
+}
+
+/*
+ * Get the size of a mapping.
+ */
+uint64
+dsm_segment_map_length(dsm_segment *seg)
+{
+ Assert(seg->mapped_address != NULL);
+ return seg->mapped_size;
+}
+
+/*
+ * Get a handle for a mapping.
+ *
+ * To establish communication via dynamic shared memory between two backends,
+ * one of them should first call dsm_create() to establish a new shared
+ * memory mapping. That process should then call dsm_segment_handle() to
+ * obtain a handle for the mapping, and pass that handle to the
+ * coordinating backend via some means (e.g. bgw_main_arg, or via the
+ * main shared memory segment). The recipient, once in position of the
+ * handle, should call dsm_attach().
+ */
+dsm_handle
+dsm_segment_handle(dsm_segment *seg)
+{
+ return seg->handle;
+}
+
+/*
+ * Create a segment descriptor.
+ */
+static dsm_segment *
+dsm_create_descriptor(void)
+{
+ dsm_segment *seg;
+
+ ResourceOwnerEnlargeDSMs(CurrentResourceOwner);
+
+ seg = MemoryContextAlloc(TopMemoryContext, sizeof(dsm_segment));
+ dlist_push_head(&dsm_segment_list, &seg->node);
+
+ /* seg->handle must be initialized by the caller */
+ seg->control_slot = INVALID_CONTROL_SLOT;
+ seg->impl_private = NULL;
+ seg->mapped_address = NULL;
+ seg->mapped_size = 0;
+
+ seg->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberDSM(CurrentResourceOwner, seg);
+
+ return seg;
+}
+
+/*
+ * Sanity check a control segment.
+ *
+ * The goal here isn't to detect everything that could possibly be wrong with
+ * the control segment; there's not enough information for that. Rather, the
+ * goal is to make sure that someone can iterate over the items in the segment
+ * without overrunning the end of the mapping and crashing. We also check
+ * the magic number since, if that's messed up, this may not even be one of
+ * our segments at all.
+ */
+static bool
+dsm_control_segment_sane(dsm_control_header *control, uint64 mapped_size)
+{
+ if (mapped_size < offsetof(dsm_control_header, item))
+ return false; /* Mapped size too short to read header. */
+ if (control->magic != PG_DYNSHMEM_CONTROL_MAGIC)
+ return false; /* Magic number doesn't match. */
+ if (dsm_control_bytes_needed(control->maxitems) > mapped_size)
+ return false; /* Max item count won't fit in map. */
+ if (control->nitems > control->maxitems)
+ return false; /* Overfull. */
+ return true;
+}
+
+/*
+ * Compute the number of control-segment bytes needed to store a given
+ * number of items.
+ */
+static uint64
+dsm_control_bytes_needed(uint32 nitems)
+{
+ return offsetof(dsm_control_header, item)
+ + sizeof(dsm_control_item) * (uint64) nitems;
+}
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
new file mode 100644
index 0000000..b00e63a
--- /dev/null
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -0,0 +1,990 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.c
+ * manage dynamic shared memory segments
+ *
+ * This file provides low-level APIs for creating and destroying shared
+ * memory segments using several different possible techniques. We refer
+ * to these segments as dynamic because they can be created, altered, and
+ * destroyed at any point during the server life cycle. This is unlike
+ * the main shared memory segment, of which there is always exactly one
+ * and which is always mapped at a fixed address in every PostgreSQL
+ * background process.
+ *
+ * Because not all systems provide the same primitives in this area, nor
+ * do all primitives behave the same way on all systems, we provide
+ * several implementations of this facility. Many systems implement
+ * POSIX shared memory (shm_open etc.), which is well-suited to our needs
+ * in this area, with the exception that shared memory identifiers live
+ * in a flat system-wide namespace, raising the uncomfortable prospect of
+ * name collisions with other processes (including other copies of
+ * PostgreSQL) running on the same system. Some systems only support
+ * the older System V shared memory interface (shmget etc.) which is
+ * also usable; however, the default allocation limits are often quite
+ * small, and the namespace is even more restricted.
+ *
+ * We also provide an mmap-based shared memory implementation. This may
+ * be useful on systems that provide shared memory via a special-purpose
+ * filesystem; by opting for this implementation, the user can even
+ * control precisely where their shared memory segments are placed. It
+ * can also be used as a fallback for systems where shm_open and shmget
+ * are not available or can't be used for some reason. Of course,
+ * mapping a file residing on an actual spinning disk is a fairly poor
+ * approximation for shared memory because writeback may hurt performance
+ * substantially, but there should be few systems where we must make do
+ * with such poor tools.
+ *
+ * As ever, Windows requires its own implemetation.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsm.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
+#include <sys/stat.h>
+#ifdef HAVE_SYS_IPC_H
+#include <sys/ipc.h>
+#endif
+#ifdef HAVE_SYS_SHM_H
+#include <sys/shm.h>
+#endif
+
+#include "portability/mem.h"
+#include "storage/dsm_impl.h"
+#include "storage/fd.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+#ifdef USE_DSM_POSIX
+static bool dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_SYSV
+static bool dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_WINDOWS
+static bool dsm_impl_windows(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+#ifdef USE_DSM_MMAP
+static bool dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel);
+#endif
+static int errcode_for_dynamic_shared_memory(void);
+
+const struct config_enum_entry dynamic_shared_memory_options[] = {
+#ifdef USE_DSM_POSIX
+ { "posix", DSM_IMPL_POSIX, false},
+#endif
+#ifdef USE_DSM_SYSV
+ { "sysv", DSM_IMPL_SYSV, false},
+#endif
+#ifdef USE_DSM_WINDOWS
+ { "windows", DSM_IMPL_WINDOWS, false},
+#endif
+#ifdef USE_DSM_MMAP
+ { "mmap", DSM_IMPL_MMAP, false},
+#endif
+ { "none", DSM_IMPL_NONE, false},
+ {NULL, 0, false}
+};
+
+/* Implementation selector. */
+int dynamic_shared_memory_type;
+
+/* Size of buffer to be used for zero-filling. */
+#define ZBUFFER_SIZE 8192
+
+/*------
+ * Perform a low-level shared memory operation in a platform-specific way,
+ * as dictated by the selected implementation. Each implementation is
+ * required to implement the following primitives.
+ *
+ * DSM_OP_CREATE. Create a segment whose size is the request_size and
+ * map it.
+ *
+ * DSM_OP_ATTACH. Map the segment, whose size must be the request_size.
+ * The segment may already be mapped; any existing mapping should be removed
+ * before creating a new one.
+ *
+ * DSM_OP_DETACH. Unmap the segment.
+ *
+ * DSM_OP_RESIZE. Resize the segment to the given request_size and
+ * remap the segment at that new size.
+ *
+ * DSM_OP_DESTROY. Unmap the segment, if it is mapped. Destroy the
+ * segment.
+ *
+ * Arguments:
+ * op: The operation to be performed.
+ * handle: The handle of an existing object, or for DSM_OP_CREATE, the
+ * a new handle the caller wants created.
+ * request_size: For DSM_OP_CREATE, the requested size. For DSM_OP_RESIZE,
+ * the new size. Otherwise, 0.
+ * impl_private: Private, implementation-specific data. Will be a pointer
+ * to NULL for the first operation on a shared memory segment within this
+ * backend; thereafter, it will point to the value to which it was set
+ * on the previous call.
+ * mapped_address: Pointer to start of current mapping; pointer to NULL
+ * if none. Updated with new mapping address.
+ * mapped_size: Pointer to size of current mapping; pointer to 0 if none.
+ * Updated with new mapped size.
+ * elevel: Level at which to log errors.
+ *
+ * Return value: true on success, false on failure. When false is returned,
+ * a message should first be logged at the specified elevel, except in the
+ * case where DSM_OP_CREATE experiences a name collision, which should
+ * silently return false.
+ *-----
+ */
+bool
+dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ Assert(op == DSM_OP_CREATE || op == DSM_OP_RESIZE || request_size == 0);
+ Assert((op != DSM_OP_CREATE && op != DSM_OP_ATTACH) ||
+ (*mapped_address == NULL && *mapped_size == 0));
+
+ if (request_size > (size_t) -1)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("requested shared memory size overflows size_t")));
+
+ switch (dynamic_shared_memory_type)
+ {
+#ifdef USE_DSM_POSIX
+ case DSM_IMPL_POSIX:
+ return dsm_impl_posix(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_SYSV
+ case DSM_IMPL_SYSV:
+ return dsm_impl_sysv(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_WINDOWS
+ case DSM_IMPL_WINDOWS:
+ return dsm_impl_windows(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+#ifdef USE_DSM_MMAP
+ case DSM_IMPL_MMAP:
+ return dsm_impl_mmap(op, handle, request_size, impl_private,
+ mapped_address, mapped_size, elevel);
+#endif
+ }
+ elog(ERROR, "unexpected dynamic shared memory type: %d",
+ dynamic_shared_memory_type);
+}
+
+/*
+ * Does the current dynamic shared memory implementation support resizing
+ * segments? (The answer here could be platform-dependent in the future,
+ * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently
+ * can't resize segments to anything larger than 256MB that way. For now,
+ * we keep it simple.)
+ */
+bool
+dsm_impl_can_resize(void)
+{
+ switch (dynamic_shared_memory_type)
+ {
+ case DSM_IMPL_NONE:
+ return false;
+ case DSM_IMPL_POSIX:
+ return true;
+ case DSM_IMPL_SYSV:
+ return false;
+ case DSM_IMPL_WINDOWS:
+ return false;
+ case DSM_IMPL_MMAP:
+ return false;
+ default:
+ return false; /* should not happen */
+ }
+}
+
+#ifdef USE_DSM_POSIX
+/*
+ * Operating system primitives to support POSIX shared memory.
+ *
+ * POSIX shared memory segments are created and attached using shm_open()
+ * and shm_unlink(); other operations, such as sizing or mapping the
+ * segment, are performed as if the shared memory segments were files.
+ *
+ * Indeed, on some platforms, they may be implemented that way. While
+ * POSIX shared memory segments seem intended to exist in a flat namespace,
+ * some operating systems may implement them as files, even going so far
+ * to treat a request for /xyz as a request to create a file by that name
+ * in the root directory. Users of such broken platforms should select
+ * a different shared memory implementation.
+ */
+static bool
+dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shm_unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Create new segment or open an existing one for attach or resize.
+ *
+ * Even though we're not going through fd.c, we should be safe against
+ * running out of file descriptors, because of NUM_RESERVED_FDS. We're
+ * only opening one extra descriptor here, and we'll close it before
+ * returning.
+ */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = shm_open(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size != request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(NULL, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ close(fd);
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_SYSV
+/*
+ * Operating system primitives to support System V shared memory.
+ *
+ * System V shared memory segments are manipulated using shmget(), shmat(),
+ * shmdt(), and shmctl(). There's no portable way to resize such
+ * segments. As the default allocation limits for System V shared memory
+ * are usually quite low, the POSIX facilities may be preferable; but
+ * those are not supported everywhere.
+ */
+static bool
+dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ key_t key;
+ int ident;
+ char *address;
+ char name[64];
+ int *ident_cache;
+
+ /* Resize is not supported for System V shared memory. */
+ if (op == DSM_OP_RESIZE)
+ {
+ elog(elevel, "System V shared memory segments cannot be resized");
+ return false;
+ }
+
+ /* Since resize isn't supported, reattach is a no-op. */
+ if (op == DSM_OP_ATTACH && *mapped_address != NULL)
+ return true;
+
+ /*
+ * POSIX shared memory and mmap-based shared memory identify segments
+ * with names. To avoid needless error message variation, we use the
+ * handle as the name.
+ */
+ snprintf(name, 64, "%lu", (unsigned long) handle);
+
+ /*
+ * The System V shared memory namespace is very restricted; names are
+ * of type key_t, which is expected to be some sort of integer data type,
+ * but not necessarily the same one as dsm_handle. Since we use
+ * dsm_handle to identify shared memory segments across processes, this
+ * might seem like a problem, but it's really not. If dsm_handle is
+ * bigger than key_t, the cast below might truncate away some bits from
+ * the handle the user-provided, but it'll truncate exactly the same bits
+ * away in exactly the same fashion every time we use that handle, which
+ * is all that really matters. Conversely, if dsm_handle is smaller than
+ * key_t, we won't use the full range of available key space, but that's
+ * no big deal either.
+ *
+ * We do make sure that the key isn't negative, because that might not
+ * be portable.
+ */
+ key = (key_t) handle;
+ if (key < 1) /* avoid compiler warning if type is unsigned */
+ key = -key;
+
+ /*
+ * There's one special key, IPC_PRIVATE, which can't be used. If we end
+ * up with that value by chance during a create operation, just pretend
+ * it already exists, so that caller will retry. If we run into it
+ * anywhere else, the caller has passed a handle that doesn't correspond
+ * to anything we ever created, which should not happen.
+ */
+ if (key == IPC_PRIVATE)
+ {
+ if (op != DSM_OP_CREATE)
+ elog(DEBUG4, "System V shared memory key may not be IPC_PRIVATE");
+ errno = EEXIST;
+ return false;
+ }
+
+ /*
+ * Before we can do anything with a shared memory segment, we have to
+ * map the shared memory key to a shared memory identifier using shmget().
+ * To avoid repeated lookups, we store the key using impl_private.
+ */
+ if (*impl_private != NULL)
+ {
+ ident_cache = *impl_private;
+ ident = *ident_cache;
+ }
+ else
+ {
+ int flags = IPCProtection;
+ size_t segsize;
+
+ /*
+ * Allocate the memory BEFORE acquiring the resource, so that we don't
+ * leak the resource if memory allocation fails.
+ */
+ ident_cache = MemoryContextAlloc(TopMemoryContext, sizeof(int));
+
+ /*
+ * When using shmget to find an existing segment, we must pass the
+ * size as 0. Passing a non-zero size which is greater than the
+ * actual size will result in EINVAL.
+ */
+ segsize = 0;
+
+ if (op == DSM_OP_CREATE)
+ {
+ flags |= IPC_CREAT | IPC_EXCL;
+ segsize = request_size;
+ }
+
+ if ((ident = shmget(key, segsize, flags)) == -1)
+ {
+ if (errno != EEXIST)
+ {
+ int save_errno = errno;
+ pfree(ident_cache);
+ errno = save_errno;
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not get shared memory segment: %m")));
+ }
+ return false;
+ }
+
+ *ident_cache = ident;
+ *impl_private = ident_cache;
+ }
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ pfree(ident_cache);
+ *impl_private = NULL;
+ if (*mapped_address != NULL && shmdt(*mapped_address) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* If we're attaching it, we must use IPC_STAT to determine the size. */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct shmid_ds shm;
+
+ if (shmctl(ident, IPC_STAT, &shm) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = shm.shm_segsz;
+ }
+
+ /* Map it. */
+ address = shmat(ident, NULL, PG_SHMAT_FLAGS);
+ if (address == (void *) -1)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ if (op == DSM_OP_CREATE)
+ shmctl(ident, IPC_RMID, NULL);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_WINDOWS
+/*
+ * Operating system primitives to support Windows shared memory.
+ *
+ * Windows shared memory implementation is done using file mapping
+ * which can be backed by either physical file or system paging file.
+ * Current implementation uses system paging file as other effects
+ * like performance are not clear for physical file and it is used in similar
+ * way for main shared memory in windows.
+ *
+ * A memory mapping object is a kernel object - they always get deleted when
+ * the last reference to them goes away, either explicitly via a CloseHandle or
+ * when the process containing the reference exits.
+ */
+static bool
+dsm_impl_windows(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address,
+ uint64 *mapped_size, int elevel)
+{
+ char *address;
+ HANDLE hmap;
+ char name[64];
+ MEMORY_BASIC_INFORMATION info;
+
+ /* Resize is not supported for Windows shared memory. */
+ if (op == DSM_OP_RESIZE)
+ {
+ elog(elevel, "Windows shared memory segments cannot be resized");
+ return false;
+ }
+
+ /* Since resize isn't supported, reattach is a no-op. */
+ if (op == DSM_OP_ATTACH && *mapped_address != NULL)
+ return true;
+
+ /*
+ * Storing the shared memory segment in the Global\ namespace, can
+ * allow any process running in any session to access that file
+ * mapping object provided that the caller has the required access rights.
+ * But to avoid issues faced in main shared memory, we are using the naming
+ * convention similar to main shared memory. We can change here once
+ * issue mentioned in GetSharedMemName is resolved.
+ */
+ snprintf(name, 64, "Global/PostgreSQL.%lu", (unsigned long) handle);
+
+ /*
+ * Handle teardown cases. Since Windows automatically destroys the object
+ * when no references reamin, we can treat it the same as detach.
+ */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && UnmapViewOfFile(*mapped_address) == 0)
+ {
+ _dosmaperr(GetLastError());
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ if (*impl_private != NULL
+ && CloseHandle(*impl_private) == 0)
+ {
+ _dosmaperr(GetLastError());
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ *impl_private = NULL;
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ return true;
+ }
+
+ /* Create new segment or open an existing one for attach. */
+ if (op == DSM_OP_CREATE)
+ {
+ DWORD size_high = (DWORD) (request_size >> 32);
+ DWORD size_low = (DWORD) request_size;
+ hmap = CreateFileMapping(INVALID_HANDLE_VALUE, /* Use the pagefile */
+ NULL, /* Default security attrs */
+ PAGE_READWRITE, /* Memory is read/write */
+ size_high, /* Upper 32 bits of size */
+ size_low, /* Lower 32 bits of size */
+ name);
+ _dosmaperr(GetLastError());
+ if (errno == EEXIST)
+ {
+ /*
+ * On Windows, when the segment already exists, a handle for the
+ * existing segment is returned. We must close it before
+ * returning. We don't do _dosmaperr here, so errno won't be
+ * modified.
+ */
+ CloseHandle(hmap);
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ }
+ else
+ {
+ hmap = OpenFileMapping(FILE_MAP_WRITE | FILE_MAP_READ,
+ FALSE, /* do not inherit the name */
+ name); /* name of mapping object */
+ _dosmaperr(GetLastError());
+ }
+
+ if (!hmap)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /* Map it. */
+ address = MapViewOfFile(hmap, FILE_MAP_WRITE | FILE_MAP_READ,
+ 0, 0, 0);
+ if (!address)
+ {
+ int save_errno;
+
+ _dosmaperr(GetLastError());
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseHandle(hmap);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * VirtualQuery gives size in page_size units, which is 4K for Windows.
+ * We need size only when we are attaching, but it's better to get the
+ * size when creating new segment to keep size consistent both for
+ * DSM_OP_CREATE and DSM_OP_ATTACH.
+ */
+ if (VirtualQuery(address, &info, sizeof(info)) == 0)
+ {
+ int save_errno;
+
+ _dosmaperr(GetLastError());
+ /* Back out what's already been done. */
+ save_errno = errno;
+ UnmapViewOfFile(address);
+ CloseHandle(hmap);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ *mapped_address = address;
+ *mapped_size = info.RegionSize;
+ *impl_private = hmap;
+
+ return true;
+}
+#endif
+
+#ifdef USE_DSM_MMAP
+/*
+ * Operating system primitives to support mmap-based shared memory.
+ *
+ * Calling this "shared memory" is somewhat of a misnomer, because what
+ * we're really doing is creating a bunch of files and mapping them into
+ * our address space. The operating system may feel obliged to
+ * synchronize the contents to disk even if nothing is being paged out,
+ * which will not serve us well. The user can relocate the pg_dynshmem
+ * directory to a ramdisk to avoid this problem, if available.
+ */
+static bool
+dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel)
+{
+ char name[64];
+ int flags;
+ int fd;
+ char *address;
+
+ snprintf(name, 64, PG_DYNSHMEM_DIR "/" PG_DYNSHMEM_MMAP_FILE_PREFIX "%lu",
+ (unsigned long) handle);
+
+ /* Handle teardown cases. */
+ if (op == DSM_OP_DETACH || op == DSM_OP_DESTROY)
+ {
+ if (*mapped_address != NULL
+ && munmap(*mapped_address, *mapped_size) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ if (op == DSM_OP_DESTROY && unlink(name) != 0)
+ {
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not remove shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ return true;
+ }
+
+ /* Create new segment or open an existing one for attach or resize. */
+ flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
+ if ((fd = OpenTransientFile(name, flags, 0600)) == -1)
+ {
+ if (errno != EEXIST)
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not open shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+
+ /*
+ * If we're attaching the segment, determine the current size; if we are
+ * creating or resizing the segment, set the size to the requested value.
+ */
+ if (op == DSM_OP_ATTACH)
+ {
+ struct stat st;
+
+ if (fstat(fd, &st) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not stat shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ request_size = st.st_size;
+ }
+ else if (*mapped_size > request_size && ftruncate(fd, request_size))
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ close(fd);
+ if (op == DSM_OP_CREATE)
+ shm_unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ else if (*mapped_size < request_size)
+ {
+ /*
+ * Allocate a buffer full of zeros.
+ *
+ * Note: palloc zbuffer, instead of just using a local char array,
+ * to ensure it is reasonably well-aligned; this may save a few
+ * cycles transferring data to the kernel.
+ */
+ char *zbuffer = (char *) palloc0(ZBUFFER_SIZE);
+ uint32 remaining = request_size;
+ bool success = true;
+
+ /*
+ * Zero-fill the file. We have to do this the hard way to ensure
+ * that all the file space has really been allocated, so that we
+ * don't later seg fault when accessing the memory mapping. This
+ * is pretty pessimal.
+ */
+ while (success && remaining > 0)
+ {
+ uint64 goal = remaining;
+
+ if (goal > ZBUFFER_SIZE)
+ goal = ZBUFFER_SIZE;
+ if (write(fd, zbuffer, goal) == goal)
+ remaining -= goal;
+ else
+ success = false;
+ }
+
+ if (!success)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno ? save_errno : ENOSPC;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not resize shared memory segment %s to " UINT64_FORMAT " bytes: %m",
+ name, request_size)));
+ return false;
+ }
+ }
+
+ /*
+ * If we're reattaching or resizing, we must remove any existing mapping,
+ * unless we've already got the right thing mapped.
+ */
+ if (*mapped_address != NULL)
+ {
+ if (*mapped_size == request_size)
+ return true;
+ if (munmap(*mapped_address, *mapped_size) != 0)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not unmap shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = NULL;
+ *mapped_size = 0;
+ }
+
+ /* Map it. */
+ address = mmap(NULL, request_size, PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_HASSEMAPHORE, fd, 0);
+ if (address == MAP_FAILED)
+ {
+ int save_errno;
+
+ /* Back out what's already been done. */
+ save_errno = errno;
+ CloseTransientFile(fd);
+ if (op == DSM_OP_CREATE)
+ unlink(name);
+ errno = save_errno;
+
+ ereport(elevel,
+ (errcode_for_dynamic_shared_memory(),
+ errmsg("could not map shared memory segment \"%s\": %m",
+ name)));
+ return false;
+ }
+ *mapped_address = address;
+ *mapped_size = request_size;
+ CloseTransientFile(fd);
+
+ return true;
+}
+#endif
+
+static int
+errcode_for_dynamic_shared_memory()
+{
+ if (errno == EFBIG || errno == ENOMEM)
+ return errcode(ERRCODE_OUT_OF_MEMORY);
+ else
+ return errcode_for_file_access();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a0b741b..040c7aa 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -30,6 +30,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
@@ -249,6 +250,10 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
ShmemBackendArrayAllocation();
#endif
+ /* Initialize dynamic shared memory facilities. */
+ if (!IsUnderPostmaster)
+ dsm_postmaster_startup();
+
/*
* Now give loadable modules a chance to set up their shmem allocations
*/
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3107f9c..44a0e75 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
+#include "storage/dsm_impl.h"
#include "storage/standby.h"
#include "storage/fd.h"
#include "storage/proc.h"
@@ -385,6 +386,7 @@ static const struct config_enum_entry synchronous_commit_options[] = {
*/
extern const struct config_enum_entry wal_level_options[];
extern const struct config_enum_entry sync_method_options[];
+extern const struct config_enum_entry dynamic_shared_memory_options[];
/*
* GUC option variables that are exported from this module
@@ -3336,6 +3338,16 @@ static struct config_enum ConfigureNamesEnum[] =
},
{
+ {"dynamic_shared_memory_type", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the dynamic shared memory implementation used."),
+ NULL
+ },
+ &dynamic_shared_memory_type,
+ DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE, dynamic_shared_memory_options,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Selects the method used for forcing WAL updates to disk."),
NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..c9cea28 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -123,6 +123,13 @@
#work_mem = 1MB # min 64kB
#maintenance_work_mem = 16MB # min 1MB
#max_stack_depth = 2MB # min 100kB
+#dynamic_shared_memory_type = posix # the default is the first option
+ # supported by the operating system:
+ # posix
+ # sysv
+ # windows
+ # mmap
+ # use none to disable dynamic shared memory
# - Disk -
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e7ec393..43542cf 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -98,6 +98,11 @@ typedef struct ResourceOwnerData
int nfiles; /* number of owned temporary files */
File *files; /* dynamically allocated array */
int maxfiles; /* currently allocated array size */
+
+ /* We have built-in support for remembering dynamic shmem segments */
+ int ndsms; /* number of owned shmem segments */
+ dsm_segment **dsms; /* dynamically allocated array */
+ int maxdsms; /* currently allocated array size */
} ResourceOwnerData;
@@ -132,6 +137,7 @@ static void PrintPlanCacheLeakWarning(CachedPlan *plan);
static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
+static void PrintDSMLeakWarning(dsm_segment *seg);
/*****************************************************************************
@@ -271,6 +277,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintRelCacheLeakWarning(owner->relrefs[owner->nrelrefs - 1]);
RelationClose(owner->relrefs[owner->nrelrefs - 1]);
}
+
+ /*
+ * Release dynamic shared memory segments. Note that dsm_detach()
+ * will remove the segment from my list, so I just have to iterate
+ * until there are none.
+ *
+ * As in the preceding cases, warn if there are leftover at commit
+ * time.
+ */
+ while (owner->ndsms > 0)
+ {
+ if (isCommit)
+ PrintDSMLeakWarning(owner->dsms[owner->ndsms - 1]);
+ dsm_detach(owner->dsms[owner->ndsms - 1]);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -438,6 +459,8 @@ ResourceOwnerDelete(ResourceOwner owner)
pfree(owner->snapshots);
if (owner->files)
pfree(owner->files);
+ if (owner->dsms)
+ pfree(owner->dsms);
pfree(owner);
}
@@ -1230,3 +1253,88 @@ PrintFileLeakWarning(File file)
"temporary file leak: File %d still referenced",
file);
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * dynamic shmem segment reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeDSMs(ResourceOwner owner)
+{
+ int newmax;
+
+ if (owner->ndsms < owner->maxdsms)
+ return; /* nothing to do */
+
+ if (owner->dsms == NULL)
+ {
+ newmax = 16;
+ owner->dsms = (dsm_segment **)
+ MemoryContextAlloc(TopMemoryContext,
+ newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+ else
+ {
+ newmax = owner->maxdsms * 2;
+ owner->dsms = (dsm_segment **)
+ repalloc(owner->dsms, newmax * sizeof(dsm_segment *));
+ owner->maxdsms = newmax;
+ }
+}
+
+/*
+ * Remember that a dynamic shmem segment is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeDSMs()
+ */
+void
+ResourceOwnerRememberDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ Assert(owner->ndsms < owner->maxdsms);
+ owner->dsms[owner->ndsms] = seg;
+ owner->ndsms++;
+}
+
+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{
+ dsm_segment **dsms = owner->dsms;
+ int ns1 = owner->ndsms - 1;
+ int i;
+
+ for (i = ns1; i >= 0; i--)
+ {
+ if (dsms[i] == seg)
+ {
+ while (i < ns1)
+ {
+ dsms[i] = dsms[i + 1];
+ i++;
+ }
+ owner->ndsms = ns1;
+ return;
+ }
+ }
+ elog(ERROR,
+ "dynamic shared memory segment %lu is not owned by resource owner %s",
+ (unsigned long) dsm_segment_handle(seg), owner->name);
+}
+
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintDSMLeakWarning(dsm_segment *seg)
+{
+ elog(WARNING,
+ "dynamic shared memory leak: segment %lu still referenced",
+ (unsigned long) dsm_segment_handle(seg));
+}
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f66f530..a6eb0d8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -182,6 +182,7 @@ const char *subdirs[] = {
"pg_xlog",
"pg_xlog/archive_status",
"pg_clog",
+ "pg_dynshmem",
"pg_notify",
"pg_serial",
"pg_snapshots",
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 8aabf3c..5eac52d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
/* Define to 1 if you have the `setsid' function. */
#undef HAVE_SETSID
+/* Define to 1 if you have the `shm_open' function. */
+#undef HAVE_SHM_OPEN
+
/* Define to 1 if you have the `sigprocmask' function. */
#undef HAVE_SIGPROCMASK
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
new file mode 100644
index 0000000..2a07c10
--- /dev/null
+++ b/src/include/portability/mem.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * mem.h
+ * portability definitions for various memory operations
+ *
+ * Copyright (c) 2001-2013, PostgreSQL Global Development Group
+ *
+ * src/include/portability/mem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MEM_H
+#define MEM_H
+
+#define IPCProtection (0600) /* access/modify by user only */
+
+#ifdef SHM_SHARE_MMU /* use intimate shared memory on Solaris */
+#define PG_SHMAT_FLAGS SHM_SHARE_MMU
+#else
+#define PG_SHMAT_FLAGS 0
+#endif
+
+/* Linux prefers MAP_ANONYMOUS, but the flag is called MAP_ANON on other systems. */
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+
+/* BSD-derived systems have MAP_HASSEMAPHORE, but it's not present (or needed) on Linux. */
+#ifndef MAP_HASSEMAPHORE
+#define MAP_HASSEMAPHORE 0
+#endif
+
+#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+
+/* Some really old systems don't define MAP_FAILED. */
+#ifndef MAP_FAILED
+#define MAP_FAILED ((void *) -1)
+#endif
+
+#endif /* MEM_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
new file mode 100644
index 0000000..2b5e722
--- /dev/null
+++ b/src/include/storage/dsm.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.h
+ * manage dynamic shared memory segments
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_H
+#define DSM_H
+
+#include "storage/dsm_impl.h"
+
+typedef struct dsm_segment dsm_segment;
+
+/* Initialization function. */
+extern void dsm_postmaster_startup(void);
+
+/* Functions that create, update, or remove mappings. */
+extern dsm_segment *dsm_create(uint64 size);
+extern dsm_segment *dsm_attach(dsm_handle h);
+extern void *dsm_resize(dsm_segment *seg, uint64 size);
+extern void *dsm_remap(dsm_segment *seg);
+extern void dsm_detach(dsm_segment *seg);
+
+/* Resource management functions. */
+extern void dsm_keep_mapping(dsm_segment *seg);
+extern dsm_segment *dsm_find_mapping(dsm_handle h);
+
+/* Informational functions. */
+extern void *dsm_segment_address(dsm_segment *seg);
+extern uint64 dsm_segment_map_length(dsm_segment *seg);
+extern dsm_handle dsm_segment_handle(dsm_segment *seg);
+
+#endif /* DSM_H */
diff --git a/src/include/storage/dsm_impl.h b/src/include/storage/dsm_impl.h
new file mode 100644
index 0000000..13f1f48
--- /dev/null
+++ b/src/include/storage/dsm_impl.h
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.h
+ * low-level dynamic shared memory primitives
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_IMPL_H
+#define DSM_IMPL_H
+
+/* Dynamic shared memory implementations. */
+#define DSM_IMPL_NONE 0
+#define DSM_IMPL_POSIX 1
+#define DSM_IMPL_SYSV 2
+#define DSM_IMPL_WINDOWS 3
+#define DSM_IMPL_MMAP 4
+
+/*
+ * Determine which dynamic shared memory implementations will be supported
+ * on this platform, and which one will be the default.
+ */
+#ifdef WIN32
+#define USE_DSM_WINDOWS
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_WINDOWS
+#else
+#ifdef HAVE_SHM_OPEN
+#define USE_DSM_POSIX
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_POSIX
+#endif
+#define USE_DSM_SYSV
+#ifndef DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE DSM_IMPL_SYSV
+#endif
+#define USE_DSM_MMAP
+#endif
+
+/* GUC. */
+extern int dynamic_shared_memory_type;
+
+/*
+ * Directory for on-disk state.
+ *
+ * This is used by all implementations for crash recovery and by the mmap
+ * implementation for storage.
+ */
+#define PG_DYNSHMEM_DIR "pg_dynshmem"
+#define PG_DYNSHMEM_MMAP_FILE_PREFIX "mmap."
+
+/* A "name" for a dynamic shared memory segment. */
+typedef uint32 dsm_handle;
+
+/* All the shared-memory operations we know about. */
+typedef enum
+{
+ DSM_OP_CREATE,
+ DSM_OP_ATTACH,
+ DSM_OP_DETACH,
+ DSM_OP_RESIZE,
+ DSM_OP_DESTROY
+} dsm_op;
+
+/* Create, attach to, detach from, resize, or destroy a segment. */
+extern bool dsm_impl_op(dsm_op op, dsm_handle handle, uint64 request_size,
+ void **impl_private, void **mapped_address, uint64 *mapped_size,
+ int elevel);
+
+/* Some implementations cannot resize segments. Can this one? */
+extern bool dsm_impl_can_resize(void);
+
+#endif /* DSM_IMPL_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 39415a3..730c47b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -80,6 +80,7 @@ typedef enum LWLockId
OldSerXidLock,
SyncRepLock,
BackgroundWorkerLock,
+ DynamicSharedMemoryControlLock,
/* Individual lock IDs end here */
FirstBufMappingLock,
FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a5d8707..6693483 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -16,6 +16,7 @@
#ifndef RESOWNER_PRIVATE_H
#define RESOWNER_PRIVATE_H
+#include "storage/dsm.h"
#include "storage/fd.h"
#include "storage/lock.h"
#include "utils/catcache.h"
@@ -80,4 +81,11 @@ extern void ResourceOwnerRememberFile(ResourceOwner owner,
extern void ResourceOwnerForgetFile(ResourceOwner owner,
File file);
+/* support for dynamic shared memory management */
+extern void ResourceOwnerEnlargeDSMs(ResourceOwner owner);
+extern void ResourceOwnerRememberDSM(ResourceOwner owner,
+ dsm_segment *);
+extern void ResourceOwnerForgetDSM(ResourceOwner owner,
+ dsm_segment *);
+
#endif /* RESOWNER_PRIVATE_H */
Hi,
On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
--- /dev/null +++ b/src/backend/storage/ipc/dsm.c +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state" +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.
Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?
+ /* Determine size for new control segment. */ + maxitems = PG_DYNSHMEM_FIXED_SLOTS + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;It seems likely that MaxConnections would be sufficient?
I think we could argue about the best way to set this until the cows
come home, but I don't think it probably matters much at this point.
We can always change the formula later as we gain experience.
However, I don't have a principled reason for assuming that only
user-connected backends will create dynamic shared memory segments.
Hm, yes. I had MaxBackends down as MaxConnections + autovacuum stuff;
but max_worker_processes are in there now, so you're right that doesn't
make sense.
+/* + * Read and parse the state file. + *Perhaps CRC32 the content?
I don't see the point. If the file contents are garbage that happens
to look like a number, we'll go "oh, there isn't any such segment" or
"oh, there is such a segment but it doesn't look like a control
segment, so forget it". There are a lot of things we really ought to
be CRCing to avoid corruption risk, but I can't see how this is
remotely one of them.
I was worried about a partially written file or containing contents from
two different postmaster cycles, but it's actually far too small for
that...
I initially had thought you'd write the contents of the entire shared
control segment there, not just it's id.
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.
Why are you using open() and not
BasicOpenFile or even OpenTransientFile?Because those don't work in the postmaster.
Oh, that's news to me. Seems strange, especially for BasicOpenFile.
+ /* Write contents. */ + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n", + (unsigned long) dsm_control_handle);Why are we upcasting the length of dsm_control_handle here? Also,
doesn't this need the usual UINT64_FORMAT thingy?dsm_handle is an alias for uint32. Is that always exactly an unsigned
int or can it sometimes be an unsigned long? I thought the latter, so
couldn't figure out how to write this portably without casting to a
type that explicitly matched the format string.
That should always be an unsigned int on platforms we support. Note that
we've printed TransactionIds that way (i.e. %u) for a long time and they
are a uint32 as well.
Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.
I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.
Well, it means we'll do a regular shutdown instead of PANICing
and *not* writing a checkpoint.
If something has corrupted our state to the point we cannot unregister
shared memory we registered, something has gone terribly wrong. Quite
possibly we've scribbled over our control structures or such. In that
case it's not proper to do a normal shutdown, we're quite possibly
writing bad data.
+ ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("dynamic shared memory control segment is not valid")));Imo that's a PANIC or at the very least a FATAL.
Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself.
"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.
+dsm_segment *
+dsm_create(uint64 size)Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...Yes, we rely on that. I don't really see that as a problem. You'd
better connect to the main shared memory segment before starting to
create your own.
I am not talking about lwlocks itself being setup but an environment
that has resource owners defined and catches errors. I am specifically
asking because you're a) ereport()ing without releasing an LWLock b)
unconditionally relying on the fact that there's a current resource
owner.
In shared_preload_libraries neither is the case afair?
Now, you could very well argue that you don't need to use dsm for
shared_preload_libraries but there are enough libraries that you can use
per session or globally. Requiring them to use both implementation or
register stuff later seems like it would complicate things.
+void * +dsm_resize(dsm_segment *seg, uint64 size) +{ + Assert(seg->control_slot != INVALID_CONTROL_SLOT); + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private, + &seg->mapped_address, &seg->mapped_size, ERROR); + return seg->mapped_address; +}Hm. That's valid when there are other backends attached? What are the
implications for already attached ones?They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.
But resizing can shrink, can it not? And we do an ftruncate() in at
least the posix shmem case. Which means the other backend will get a
SIGSEGV accessing that memory IIRC.
Shouldn't we error out if !dsm_impl_can_resize()?
The implementation-specific code throws an error if it can't support
resize. Even if we put a secondary check here, I wouldn't want
dsm_impl_op to behave in an undefined manner when asked to resize
under an implementation that can't. And there doesn't seem to be much
point in having two checks.
Well, You have the check in dsm_remap() which seems strange to me.
+void +dsm_detach(dsm_segment *seg) +{Why do we want to ignore errors like failing to unmap? ISTM that
indicates an actual problem...Sure it does. But what are you going to do about it? In many cases,
you're going to get here during a transaction abort caused by some
other error. If the transaction is already aborting, throwing an
error here will just cause the original error to get discarded in
favor of showing this one, or maybe it's the other way around. I
don't remember, but it's definitely one or the other, and neither is
desirable. Throwing a warning, on the other hand, will notify the
user, which is what we want.Now on the flip side we might not be aborting; maybe we're committing.
But we don't want to turn a commit into an abort just for this. If
resowner.c detects a buffer pin leak or a tuple descriptor leak, those
are "just" warning as well. They're serious warnings, of course, and
if they happen it means there's a bug in the code that needs to be
fixed. But the severity of an ereport() isn't based just on how
alarming the situation is; it's based on what you want to happen when
that situation comes up. And we've decided (correctly, I think) that
resource leaks are not grounds for aborting a transaction that
otherwise would have committed.
We're not talking about a missed munmap() but about one that failed. If
we unpin the leaked pins and notice that we haven't actually pinned it
anymore we do error (well, Assert) out. Same for TupleDescs.
If there were valid scenarios in which you could get into that
situation, maybe. But which would that be? ISTM we can only get there if
our internal state is messed up.
+ * several implementations of this facility. Many systems implement + * POSIX shared memory (shm_open etc.), which is well-suited to our needs + * in this area, with the exception that shared memory identifiers live + * in a flat system-wide namespace, raising the uncomfortable prospect of + * name collisions with other processes (including other copies of + * PostgreSQL) running on the same system.Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."
What about "/pgsql.%u" or something similar? That should still fit.
+ /* + * If we're reattaching or resizing, we must remove any existing mapping, + * unless we've already got the right thing mapped. + */ + if (*mapped_address != NULL) + { + if (*mapped_size == request_size) + return true;Hm. It could have gotten resized to the old size, or resized twice. In
that case it might not be at the same address before, so checking for
the size doesn't seem to be sufficient.I don't understand your concern. If someone resize the DSM to its
already-current size, there is no need to remap it. The old mapping
is just fine. And if some other backend resizes the DSM to a larger
size and then back to the original size, and then we're asked to
update the mapping, there is no need to change anything.
Yes, forget what I said. I was confusing myself.
+static int +errcode_for_dynamic_shared_memory() +{ + if (errno == EFBIG || errno == ENOMEM) + return errcode(ERRCODE_OUT_OF_MEMORY); + else + return errcode_for_file_access(); +}Is EFBIG guaranteed to be defined?
I dunno. We could put an #ifdef around that part. Should we do that
now or wait and see if it actually breaks anywhere?
A bit of googling around seems to indicate it's likely to be
available. Even on windows according to MSDN.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
--- /dev/null +++ b/src/backend/storage/ipc/dsm.c +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state" +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?+ /* Determine size for new control segment. */ + maxitems = PG_DYNSHMEM_FIXED_SLOTS + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;It seems likely that MaxConnections would be sufficient?
I think we could argue about the best way to set this until the cows
come home, but I don't think it probably matters much at this point.
We can always change the formula later as we gain experience.
However, I don't have a principled reason for assuming that only
user-connected backends will create dynamic shared memory segments.Hm, yes. I had MaxBackends down as MaxConnections + autovacuum stuff;
but max_worker_processes are in there now, so you're right that doesn't
make sense.+/* + * Read and parse the state file. + *Perhaps CRC32 the content?
I don't see the point. If the file contents are garbage that happens
to look like a number, we'll go "oh, there isn't any such segment" or
"oh, there is such a segment but it doesn't look like a control
segment, so forget it". There are a lot of things we really ought to
be CRCing to avoid corruption risk, but I can't see how this is
remotely one of them.I was worried about a partially written file or containing contents from
two different postmaster cycles, but it's actually far too small for
that...
I initially had thought you'd write the contents of the entire shared
control segment there, not just it's id.+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.
On checking about this in code, I found the below comment which
suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
to open a file):
/*
* NOTE: this is also used for opening text files.
* WIN32 treats Control-Z as EOF in files opened in text mode.
* Therefore, we open files in binary mode on Win32 so we can read
* literal control-Z. The other affect is that we see CRLF, but
* that is OK because we can already handle those cleanly.
*/
Second instance, I noticed in code as below which again suggests CRLF
should not be an issue until file mode is specifically set to TEXT
mode which is not the case with current usage of open in dynamic
shared memory code.
#ifdef WIN32
/* use CRLF line endings on Windows */
_setmode(_fileno(fh), _O_TEXT);
#endif
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.On checking about this in code, I found the below comment which
suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
to open a file):/*
* NOTE: this is also used for opening text files.
* WIN32 treats Control-Z as EOF in files opened in text mode.
* Therefore, we open files in binary mode on Win32 so we can read
* literal control-Z. The other affect is that we see CRLF, but
* that is OK because we can already handle those cleanly.
*/
That comment appears at the definition of PG_BINARY. You only get what it
describes when you use PG_BINARY.
Second instance, I noticed in code as below which again suggests CRLF
should not be an issue until file mode is specifically set to TEXT
mode which is not the case with current usage of open in dynamic
shared memory code.#ifdef WIN32
/* use CRLF line endings on Windows */
_setmode(_fileno(fh), _O_TEXT);
#endif
I suspect that call (in logfile_open()) has no effect. The file is already in
text mode.
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah@leadboat.com> wrote:
On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.On checking about this in code, I found the below comment which
suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
to open a file):/*
* NOTE: this is also used for opening text files.
* WIN32 treats Control-Z as EOF in files opened in text mode.
* Therefore, we open files in binary mode on Win32 so we can read
* literal control-Z. The other affect is that we see CRLF, but
* that is OK because we can already handle those cleanly.
*/That comment appears at the definition of PG_BINARY. You only get what it
describes when you use PG_BINARY.Second instance, I noticed in code as below which again suggests CRLF
should not be an issue until file mode is specifically set to TEXT
mode which is not the case with current usage of open in dynamic
shared memory code.#ifdef WIN32
/* use CRLF line endings on Windows */
_setmode(_fileno(fh), _O_TEXT);
#endifI suspect that call (in logfile_open()) has no effect. The file is already in
text mode.
Won't this be required when we have to open a new file due to log
rotation based on time?
The basic point, I wanted to make is that until you use _O_TEXT mode
explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
which is used for windows implementation of open doesn't take any
parameter which specifies it as text or binary, only by using
_setmode, we need to set the file mode as Text or Binary.
I checked fcntl.h where there is below comment above definition of
_O_TEXT and _O_BINARY which again is pointing to what I said above.
/* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
** and <lf> sequences translated to <cr><lf> on write()'s
*/
One more point, this issue has only chance of occurring when somebody
takes the file from unix system to windows and then may be back, do
you think dsm state file should be allowed in cross platform backup, I
think pg_basebackup should disallow the backup of this file.
However user can use some other custom utility to take filesystem
level backup where this can happen, but still as per my understanding
it should not create problem.
I think to be on safe side we can use PG_BINARY, but it would be
better if we are sure that this problem can occur then only we should
use it.
If you think cross platform backup's can create such issues, then I
can once test this as well.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 23, 2013 at 10:43:33AM +0530, Amit Kapila wrote:
On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah@leadboat.com> wrote:
On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
#ifdef WIN32
/* use CRLF line endings on Windows */
_setmode(_fileno(fh), _O_TEXT);
#endifI suspect that call (in logfile_open()) has no effect. The file is already in
text mode.Won't this be required when we have to open a new file due to log
rotation based on time?The basic point, I wanted to make is that until you use _O_TEXT mode
explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
which is used for windows implementation of open doesn't take any
parameter which specifies it as text or binary, only by using
_setmode, we need to set the file mode as Text or Binary.
You are indeed correct. I had assumed that pgwin32_open() does not change the
usual Windows open()/fopen() behavior concerning line endings. No code
comment mentions otherwise, and that would make pro forma our pervasive use of
PG_BINARY. Nonetheless, it behaves as you say. I wonder if that was
intentional, and I wonder if the outcome varies between Visual Studio versions
(I tested with VS2010).
I checked fcntl.h where there is below comment above definition of
_O_TEXT and _O_BINARY which again is pointing to what I said above.
/* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
** and <lf> sequences translated to <cr><lf> on write()'s
*/
However, O_TEXT is the default in a normal Windows program:
http://msdn.microsoft.com/en-us/library/ktss1a9b.aspx
I think to be on safe side we can use PG_BINARY, but it would be
better if we are sure that this problem can occur then only we should
use it.
If you think cross platform backup's can create such issues, then I
can once test this as well.
I don't know whether writing it as binary will help or hurt that situation.
If nothing else, binary gives you one less variation to think about when
studying the code. Anyone sophisticated enough to meaningfully examine the
file will have no trouble dealing with either line ending convention.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 12:32 AM, Noah Misch <noah@leadboat.com> wrote:
On Mon, Sep 23, 2013 at 10:43:33AM +0530, Amit Kapila wrote:
On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah@leadboat.com> wrote:
On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
#ifdef WIN32
/* use CRLF line endings on Windows */
_setmode(_fileno(fh), _O_TEXT);
#endifI suspect that call (in logfile_open()) has no effect. The file is already in
text mode.Won't this be required when we have to open a new file due to log
rotation based on time?The basic point, I wanted to make is that until you use _O_TEXT mode
explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
which is used for windows implementation of open doesn't take any
parameter which specifies it as text or binary, only by using
_setmode, we need to set the file mode as Text or Binary.You are indeed correct. I had assumed that pgwin32_open() does not change the
usual Windows open()/fopen() behavior concerning line endings. No code
comment mentions otherwise, and that would make pro forma our pervasive use of
PG_BINARY.
The only form of comment to give an indication (or atleast I got the
indication from there) is what I have mentioned in above mail chain
at top of PG_BINARY. I understand that it is not very clear in that
comment about the actual handling of CRLF issue.
Nonetheless, it behaves as you say. I wonder if that was
intentional, and I wonder if the outcome varies between Visual Studio versions
(I tested with VS2010).
Ideally it should not the depend on VS version as outcome depends on
API (CreateFile/setmode) usage.
I checked fcntl.h where there is below comment above definition of
_O_TEXT and _O_BINARY which again is pointing to what I said above.
/* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
** and <lf> sequences translated to <cr><lf> on write()'s
*/However, O_TEXT is the default in a normal Windows program:
http://msdn.microsoft.com/en-us/library/ktss1a9b.aspxI think to be on safe side we can use PG_BINARY, but it would be
better if we are sure that this problem can occur then only we should
use it.
If you think cross platform backup's can create such issues, then I
can once test this as well.I don't know whether writing it as binary will help or hurt that situation.
If nothing else, binary gives you one less variation to think about when
studying the code.
In that case, shouldn't all other places be consistent. One reason I
had in mind for
using appropriate mode is that somebody reading code can tomorrow come
up with a question or a
patch to use correct mode, then we will again be in same situation.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?
That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.
Are there any other likely candidates for inclusion in that directory
other than this stuff?
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.
Well, I'm happy to do whatever the consensus is. AFAICT you and Noah
are both for it and Amit's position is that it doesn't matter either
way, so I'll go ahead and change that unless further discussion sheds
a different light on things.
Why are you using open() and not
BasicOpenFile or even OpenTransientFile?Because those don't work in the postmaster.
Oh, that's news to me. Seems strange, especially for BasicOpenFile.
Per its header comment, InitFileAccess is not called in the
postmaster, so there's no VFD cache. Thus, any attempt by
BasicOpenFile to call ReleaseLruFile would be pointless at best.
dsm_handle is an alias for uint32. Is that always exactly an unsigned
int or can it sometimes be an unsigned long? I thought the latter, so
couldn't figure out how to write this portably without casting to a
type that explicitly matched the format string.That should always be an unsigned int on platforms we support. Note that
we've printed TransactionIds that way (i.e. %u) for a long time and they
are a uint32 as well.
Fixed.
Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.Well, it means we'll do a regular shutdown instead of PANICing
and *not* writing a checkpoint.
If something has corrupted our state to the point we cannot unregister
shared memory we registered, something has gone terribly wrong. Quite
possibly we've scribbled over our control structures or such. In that
case it's not proper to do a normal shutdown, we're quite possibly
writing bad data.
I have to admit I didn't consider the possibility of an
otherwise-clean shutdown that hit only this problem. I'm not sure how
seriously to take that case. I guess we could emit warnings for
individual failures and then throw an error at the end if there were >
0, but that seems a little ugly. Or we could go whole hog and treat
any failure as a critical error. Anyone else have an opinion on what
to do here?
Imo that's a PANIC or at the very least a FATAL.
Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself."There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.
I doubt it. It's more likely that the file permissions got changed or
something.
Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...Yes, we rely on that. I don't really see that as a problem. You'd
better connect to the main shared memory segment before starting to
create your own.I am not talking about lwlocks itself being setup but an environment
that has resource owners defined and catches errors. I am specifically
asking because you're a) ereport()ing without releasing an LWLock b)
unconditionally relying on the fact that there's a current resource
owner.
In shared_preload_libraries neither is the case afair?Now, you could very well argue that you don't need to use dsm for
shared_preload_libraries but there are enough libraries that you can use
per session or globally. Requiring them to use both implementation or
register stuff later seems like it would complicate things.
Well, the cleanup logic gets a lot more complicated without a resource
owner. The first few drafts of this logic didn't involve any resource
owner integration and things got quite a bit simpler and nicer when I
added that. For example, in dsm_create(), we do
dsm_create_descriptor() and then loop until we find an unused segment
identifier. If dsm_impl_op() throws an ERROR, which it definitely
can, then the segment produced by dsm_create_descriptor() is still
lying around in dsm_segment_list, and without the resource owner
machinery, that's a permanent leak. Certainly, it's fixable. You can
put PG_TRY() blocks in everywhere and solve the problem that way. But
I'm not very keen on going that route; it looks like it will be
painful and messy.
I also do not think that allocating dynamic shared memory segments in
shared_preload_libraries is actually sensible. You're in the
postmaster at that point, and the main shared memory segment is not
set up. If you were to map a shared memory segment at that point, the
mapping would get inherited in EXEC_BACKEND environments but not
otherwise, so we'd need more infrastructure to handle that. And, of
course, we couldn't use LWLocks for synchronization. I think that we
couldn't use spinlocks either, even if it were otherwise acceptable,
since with --disable-spinlocks those are going to turn into semaphores
that I don't think are available at this point either.
I don't really feel like solving all of those problems and, TBH, I
don't see why it's particularly important. If somebody wants a
loadable module that can be loaded either from
shared_preload_libraries or on the fly, and they use dynamic shared
memory in the latter case, then they can use it in the former case as
well. If they've already got logic to create the DSM when it's first
needed, it doesn't cost extra to do it that way in both cases.
They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.But resizing can shrink, can it not? And we do an ftruncate() in at
least the posix shmem case. Which means the other backend will get a
SIGSEGV accessing that memory IIRC.
Yep. Shrinking the shared memory segment will require special
caution. Caveat emptor.
Shouldn't we error out if !dsm_impl_can_resize()?
The implementation-specific code throws an error if it can't support
resize. Even if we put a secondary check here, I wouldn't want
dsm_impl_op to behave in an undefined manner when asked to resize
under an implementation that can't. And there doesn't seem to be much
point in having two checks.Well, You have the check in dsm_remap() which seems strange to me.
Oh, fair point. Removed.
Now on the flip side we might not be aborting; maybe we're committing.
But we don't want to turn a commit into an abort just for this. If
resowner.c detects a buffer pin leak or a tuple descriptor leak, those
are "just" warning as well. They're serious warnings, of course, and
if they happen it means there's a bug in the code that needs to be
fixed. But the severity of an ereport() isn't based just on how
alarming the situation is; it's based on what you want to happen when
that situation comes up. And we've decided (correctly, I think) that
resource leaks are not grounds for aborting a transaction that
otherwise would have committed.We're not talking about a missed munmap() but about one that failed. If
we unpin the leaked pins and notice that we haven't actually pinned it
anymore we do error (well, Assert) out. Same for TupleDescs.If there were valid scenarios in which you could get into that
situation, maybe. But which would that be? ISTM we can only get there if
our internal state is messed up.
I don't know. I think that's part of why it's hard to decide what we
want to happen. But personally I think it's paranoid to say, well,
something happened that we weren't expecting, so that must mean
something totally horrible has happened and we'd better die in a fire.
I mean, the fact that the checks you are talking about are assertions
means that they are scenarios we expect never to happen, and therefore
we don't even check for them in a production build. I don't think you
can use that as a precedent to show that any failure here is an
automatic PANIC.
Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."What about "/pgsql.%u" or something similar? That should still fit.
Well, if you want both the port and the identifier in there, that
doesn't get you there.
+static int +errcode_for_dynamic_shared_memory() +{ + if (errno == EFBIG || errno == ENOMEM) + return errcode(ERRCODE_OUT_OF_MEMORY); + else + return errcode_for_file_access(); +}Is EFBIG guaranteed to be defined?
I dunno. We could put an #ifdef around that part. Should we do that
now or wait and see if it actually breaks anywhere?A bit of googling around seems to indicate it's likely to be
available. Even on windows according to MSDN.
Cool.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-24 12:19:51 -0400, Robert Haas wrote:
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.
pg_node_local/ was the only reasonable thing I could think of otherwise,
and I disliked it because it seems we shouldn't introduce "node" as a
term just for this.
Are there any other likely candidates for inclusion in that directory
other than this stuff?
You could argue that pg_stat_tmp/ is one.
Why are you using open() and not
BasicOpenFile or even OpenTransientFile?Because those don't work in the postmaster.
Oh, that's news to me. Seems strange, especially for BasicOpenFile.
Per its header comment, InitFileAccess is not called in the
postmaster, so there's no VFD cache. Thus, any attempt by
BasicOpenFile to call ReleaseLruFile would be pointless at best.
Well, but it makes code running in both backends and postmaster easier
to write. Good enough for me anyway.
Imo that's a PANIC or at the very least a FATAL.
Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself."There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.I doubt it. It's more likely that the file permissions got changed or
something.
We panic in that case during a shutdown, don't we? ... Yep:
PANIC: could not open control file "global/pg_control": Permission denied
I am not talking about lwlocks itself being setup but an environment
that has resource owners defined and catches errors. I am specifically
asking because you're a) ereport()ing without releasing an LWLock b)
unconditionally relying on the fact that there's a current resource
owner.
In shared_preload_libraries neither is the case afair?
I don't really feel like solving all of those problems and, TBH, I
don't see why it's particularly important. If somebody wants a
loadable module that can be loaded either from
shared_preload_libraries or on the fly, and they use dynamic shared
memory in the latter case, then they can use it in the former case as
well. If they've already got logic to create the DSM when it's first
needed, it doesn't cost extra to do it that way in both cases.
Fair enough.
They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.But resizing can shrink, can it not? And we do an ftruncate() in at
least the posix shmem case. Which means the other backend will get a
SIGSEGV accessing that memory IIRC.
Yep. Shrinking the shared memory segment will require special
caution. Caveat emptor.
Then a comment to that effect would be nice.
We're not talking about a missed munmap() but about one that failed. If
we unpin the leaked pins and notice that we haven't actually pinned it
anymore we do error (well, Assert) out. Same for TupleDescs.If there were valid scenarios in which you could get into that
situation, maybe. But which would that be? ISTM we can only get there if
our internal state is messed up.
I don't know. I think that's part of why it's hard to decide what we
want to happen. But personally I think it's paranoid to say, well,
something happened that we weren't expecting, so that must mean
something totally horrible has happened and we'd better die in a fire.
Well, by that argument we wouldn't need to PANIC on a whole host of
issues. Like segfaults.
Anyway, I guess we need other opinions here.
Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."What about "/pgsql.%u" or something similar? That should still fit.
Well, if you want both the port and the identifier in there, that
doesn't get you there.
Port seems enough to start with - most machines are configured to only
have one cluster starting on one port. That way we wouldn't always get
conflicts but just if somebody does something crazy.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.Are there any other likely candidates for inclusion in that directory
other than this stuff?
pgsql_tmp. Refer sendDir() in basebackup.c, there we avoid sending
files in backup.
Some of future features like ALTER SYSTEM, can also use it for tmp file.
+ /* Create or truncate the file. */ + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);Doesn't this need a | PG_BINARY?
It's a text file. Do we need PG_BINARY anyway?
I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.Well, I'm happy to do whatever the consensus is. AFAICT you and Noah
are both for it and Amit's position is that it doesn't matter either
way
I am sorry If my mails doesn't say that I am in favour of keeping code
as it is unless there is really a case which requires it.
Basically as per my understanding, I have presented some facts in
above mails which indicates, there is no need for PG_BINARY in this
case.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.
I like the concept and have no improvements on the name.
Are there any other likely candidates for inclusion in that directory
other than this stuff?
pg_xlog
Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.Well, it means we'll do a regular shutdown instead of PANICing
and *not* writing a checkpoint.
If something has corrupted our state to the point we cannot unregister
shared memory we registered, something has gone terribly wrong. Quite
possibly we've scribbled over our control structures or such. In that
case it's not proper to do a normal shutdown, we're quite possibly
writing bad data.I have to admit I didn't consider the possibility of an
otherwise-clean shutdown that hit only this problem. I'm not sure how
seriously to take that case. I guess we could emit warnings for
individual failures and then throw an error at the end if there were >
0, but that seems a little ugly. Or we could go whole hog and treat
any failure as a critical error. Anyone else have an opinion on what
to do here?
There's extensive precedent in our code for LOG, WARNING, or even ignoring the
return value of unlink(). (To my surprise, ignoring the return value is the
most popular choice.) Of the dozens of backend callers, here is the mixed bag
that actually raises ERROR or better:
do_pg_stop_backup()
RestoreArchivedFile()
KeepFileRestoredFromArchive()
create_tablespace_directories() [remove old symlink during recovery]
destroy_tablespace_directories()
RelationCacheInitFilePreInvalidate()
CreateLockFile()
I think it's awfully unlikely that runaway code would corrupt shared_buffers
AND manage to make an unlink() fail.
Imo that's a PANIC or at the very least a FATAL.
Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself."There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.
+1 for making this one a PANIC, though. With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents. The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/26/13 8:27 AM, Noah Misch wrote:
On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund<andres@2ndquadrant.com> wrote:
Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.I like the concept and have no improvements on the name.
Are there any other likely candidates for inclusion in that directory
other than this stuff?pg_xlog
Isn't it also pointless to backup temp objects as well as non-logged tables?
Or is the purpose of pg_local to be a home for things that MUST NOT be backed up as opposed to items where backing them up is pointless?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 2:45 PM, Jim Nasby <jim@nasby.net> wrote:
On 9/26/13 8:27 AM, Noah Misch wrote:
On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund<andres@2ndquadrant.com>
wrote:Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.I like the concept and have no improvements on the name.
Are there any other likely candidates for inclusion in that directory
other than this stuff?pg_xlog
Isn't it also pointless to backup temp objects as well as non-logged tables?
Or is the purpose of pg_local to be a home for things that MUST NOT be
backed up as opposed to items where backing them up is pointless?
I don't know. I found it surprising that Noah suggested including
pg_xlog; that's certainly not "node-local state" in any meaningful
sense, the way dsm stuff is. But I guess if pg_basebackup excludes it
it arguably qualifies. However, it'd be nice to advertise that
pg_local can be cleared when the server is shut down, and you
certainly CaN nOt do that to pg_xlog. Whee!
I think the way I'd summarize the state of this patch is that everyone
who has looked at it more or less agrees that the big picture is
right, but there are details, which I think everyone admits are pretty
much corner cases, where different people have different ideas about
which way to go and what risks and benefits might be thereby incurred.
I'll not pretend that my opinions are categorically better than any
others, but of course I like them because they are mine. Figuring out
just what to do to about those last details seems challenging; no
answer will completely please everyone.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 08:58:36AM +0530, Amit Kapila wrote:
On Tue, Sep 24, 2013 at 12:32 AM, Noah Misch <noah@leadboat.com> wrote:
I don't know whether writing it as binary will help or hurt that situation.
If nothing else, binary gives you one less variation to think about when
studying the code.In that case, shouldn't all other places be consistent. One reason I
had in mind for
using appropriate mode is that somebody reading code can tomorrow come
up with a question or a
patch to use correct mode, then we will again be in same situation.
There are cases that must use binary I/O (table data files), cases that
benefit notably from text I/O (log files, postgresql.conf), and cases where it
doesn't matter too much (dsm state file, postmaster.pid). I don't see a need
to make widespread changes to other call sites.
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah@leadboat.com> wrote:
"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.+1 for making this one a PANIC, though. With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents. The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.
To respond specifically to this point... inability to open a file on
disk does not mean that shared memory is corrupted. Full stop.
A scenario I have seen a few times is that someone changes the
permissions on part or all of $PGDATA while the server is running. I
have only ever seen this happen on Windows. What typically happens
today - depending on the exact scenario - is that the checkpoints will
fail, but the server will remain up, sometimes even committing
transactions under synchronous_commit=off, even though it can't write
out its data. If you fix the permissions before shutting down the
server, you don't even lose any data. Making inability to read a file
into a PANIC condition will cause any such cluster to remain up only
as long as nobody tries to use dynamic shared memory, and then throw
up its guts. I don't think users will appreciate that.
I am tempted to commit the latest version of this patch as I have it.
I think there's a lot of bikeshedding left to be done here, but
there's no real reason why we can't change this subsequent to the
initial commit as the answers become more clear. Changing the error
levels used for particular messages, or rearranging the directory
structure, is quite trivial. But we can't do that as long as we have
N people with >=N opinions on what to do, and the way to get more
clarity there is to get the code out in front of a few more people and
see how things shake out.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-08 15:40:04 -0400, Robert Haas wrote:
I am tempted to commit the latest version of this patch as I have it.
I haven't looked at the latest version of the patch, but based on the
previous version I have no problem with that.
If you'd feel more comfortable with another round of review, scanning
for things other than elevels, I can do that towards the weekend. Before
or after you've committed.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 08, 2013 at 03:40:04PM -0400, Robert Haas wrote:
On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah@leadboat.com> wrote:
"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.+1 for making this one a PANIC, though. With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents. The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.To respond specifically to this point... inability to open a file on
disk does not mean that shared memory is corrupted. Full stop.A scenario I have seen a few times is that someone changes the
permissions on part or all of $PGDATA while the server is running.
I was discussing the third ereport() in dsm_backend_startup(), which does not
pertain to inability to open a file. The second ereport() would fire in the
damaged-permissions scenario, and I fully agree with that one using ERROR.
Incidentally, dsm_backend_startup() has a typo: s/"one/"none/
I am tempted to commit the latest version of this patch as I have it.
Works for me.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah@leadboat.com> wrote:
"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.+1 for making this one a PANIC, though. With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents. The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.To respond specifically to this point... inability to open a file on
disk does not mean that shared memory is corrupted. Full stop.A scenario I have seen a few times is that someone changes the
permissions on part or all of $PGDATA while the server is running. I
have only ever seen this happen on Windows. What typically happens
today - depending on the exact scenario - is that the checkpoints will
fail, but the server will remain up, sometimes even committing
transactions under synchronous_commit=off, even though it can't write
out its data. If you fix the permissions before shutting down the
server, you don't even lose any data. Making inability to read a file
into a PANIC condition will cause any such cluster to remain up only
as long as nobody tries to use dynamic shared memory, and then throw
up its guts. I don't think users will appreciate that.I am tempted to commit the latest version of this patch as I have it.
1. Do you think we should add information about pg_dynshmem file at link:
http://www.postgresql.org/docs/devel/static/storage-file-layout.html
It contains information about all files/folders in data directory
2.
+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{
Above function description should use 'dynamic shmem segment' rather
than temporary file.
"Forget that a dynamic shmem segment is owned by a ResourceOwner"
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 13, 2013 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
1. Do you think we should add information about pg_dynshmem file at link:
http://www.postgresql.org/docs/devel/static/storage-file-layout.html
It contains information about all files/folders in data directory2. +/* + * Forget that a temporary file is owned by a ResourceOwner + */ +void +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg) +{Above function description should use 'dynamic shmem segment' rather
than temporary file.
"Forget that a dynamic shmem segment is owned by a ResourceOwner"
Good catches, will fix.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 14, 2013 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Oct 13, 2013 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
1. Do you think we should add information about pg_dynshmem file at link:
http://www.postgresql.org/docs/devel/static/storage-file-layout.html
It contains information about all files/folders in data directory2. +/* + * Forget that a temporary file is owned by a ResourceOwner + */ +void +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg) +{Above function description should use 'dynamic shmem segment' rather
than temporary file.
"Forget that a dynamic shmem segment is owned by a ResourceOwner"Good catches, will fix.
During test, I found one issue in Windows implementation.
During startup, when it tries to create new control segment for
dynamic shared memory, it loops until an unused identifier is found,
but for Windows implementation (dsm_impl_windows()), it was returning
error for EEXIST. This error will convert into FATAL as it is during
postmaster startup and will not allow server to start.
Please find attached patch to fix the problem.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_error_handling_eexist_windows.patchapplication/octet-stream; name=fix_error_handling_eexist_windows.patchDownload
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 8e72731..9f1ea5b 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -694,10 +694,6 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, uint64 request_size,
* modified.
*/
CloseHandle(hmap);
- ereport(elevel,
- (errcode_for_dynamic_shared_memory(),
- errmsg("could not open shared memory segment \"%s\": %m",
- name)));
return false;
}
}
On Mon, Oct 14, 2013 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
During test, I found one issue in Windows implementation.
During startup, when it tries to create new control segment for
dynamic shared memory, it loops until an unused identifier is found,
but for Windows implementation (dsm_impl_windows()), it was returning
error for EEXIST. This error will convert into FATAL as it is during
postmaster startup and will not allow server to start.Please find attached patch to fix the problem.
Committed, thanks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers