dynamic shared memory

Started by Robert Haasover 12 years ago41 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

Please find attached a first version of a patch to allow additional
"dynamic" shared memory segments; that is, shared memory segments that
are created after server startup, live for a period of time, and are
then destroyed when no longer needed. The main purpose of this patch
is to facilitate parallel query: if we've got multiple backends
working on the same query, they're going to need a way to communicate.
Doing that through the main shared memory segment seems infeasible
because we could, for some applications, need to share very large
amounts of data. For example, for internal sort, we basically load
the data to be sorted into memory and then rearrange an array of
pointers to the items being sorted. For parallel internal sort, we
might want to do much the same thing, but with different backend
processes manipulating different parts of the array. I'm not exactly
sure how that's going to work out yet in detail, but it seems fair to
say that the amount of data we want to share between processes there
could be quite a bit larger than anything we'd feel comfortable
nailing down in the permanent shared memory segment. Other cases,
like parallel sequential scan, might require much smaller buffers,
since there might not be much point in letting the scan get too far
ahead if nothing's consuming the tuples it produces. With this
infrastructure, we can choose at run-time exactly how much memory to
allocate for a particular purpose and return it to the operating
system as soon as we're done with it.

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type. Since we
currently require System V shared memory to be supported on all
platforms other than Windows, I have included a System V
implementation (shmget, shmctl, shmat, shmdt). However, as we know,
on many systems, System V shared memory limits are often low out of
the box and raising them is an annoyance for users. Therefore, I've
included an implementation based on POSIX shared memory facilities
(shm_open, shm_unlink), which is the default on systems where those
facilities are supported (some of the BSDs do not, I believe). We
will also need a Windows implementation, which I have not attempted,
but one of my colleagues at EnterpriseDB will be filling in that gap.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging. On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments. Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.

Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other. There are
ways, on most if not all platforms, to pass file descriptors through
sockets, but there's not automatically a socket connection between the
two processes either, so it gets hairy to think about making this
work. I did, however, include a "none" implementation which has the
effect of shutting the facility off altogether.

The actual implementation is split up into two layers. dsm_impl.c/h
encapsulate the implementation-dependent functionality at a very raw
level, while dsm.c/h wrap that functionality in a more palatable API.
Most of that wrapper layer is concerned with just one problem:
avoiding leaks. This turned out to require multiple levels of
safeguards, which I duly implemented. First, dynamic shared memory
segments need to be reference-counted, so that when the last mapping
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that). If a
backend is terminated uncleanly, the postmaster needs to remove all
leftover segments during the crash-and-restart process, just as it
needs to reinitialize the main shared memory segment. And if all
processes are terminated uncleanly, the next postmaster startup needs
to clean up any segments that still exist, again just as we already do
for the main shared memory segment. Neither POSIX shared memory nor
System V shared memory provide an API for enumerating all existing
shared memory segments, so we must keep track ourselves of what we
have outstanding. Second, we need to ensure, within the scope of an
individual process, that we only retain a mapping for as long as
necessary. Just as memory contexts, locks, buffer pins, and other
resources automatically go away at the end of a query or
(sub)transaction, dynamic shared memory mappings created for a purpose
such as parallel sort need to go away if we abort mid-way through. Of
course, if you have a user backend coordinating with workers, it seems
pretty likely that the workers are just going to exit if they hit an
error, so having the mapping be process-lifetime wouldn't necessarily
be a big deal; but the user backend may stick around for a long time
and execute other queries, and we can't afford to have it accumulate
mappings, not least because that's equivalent to a session-lifespan
memory leak.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly. If the whole operating system is rebooted, the
old control segment won't exist any more, but that's OK, because none
of the other segments will either - except under the
mmap-a-regular-file implementation, which handles cleanup by scanning
the relevant directory rather than relying on the control segment.
These precautions seem sufficient to ensure that dynamic shared memory
segments can't survive the postmaster itself short of a hard kill, and
that even after a hard kill we'll clean things up on a subsequent
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).

There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch. Second, it doesn't make any
policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.

But, just for testing purposes, I did just that. The attached patch
includes contrib/dsm_demo, which lets you say
dsm_demo_create('something') in one string, and if you pass the return
value to dsm_demo_read() in the same or another session during the
lifetime of the first session, you'll read back the same value you
saved. This is not, by any stretch of the imagination, a
demonstration of the right way to use this facility - but as a crude
unit test, it suffices. Although I'm including it in the patch file,
I would anticipate removing it before commit. Hopefully, with a
little more functionality on top of what's included here, we'll soon
be in a position to build something that might actually be useful to
someone, but this layer itself is a bit too impoverished to build
something really cool, at least not without more work than I wanted to
put in as part of the development of this patch.

Using that crappy contrib module, I verified that the POSIX, System V,
and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and
on Linux (Fedora 16). I wouldn't like to have to wager on having
gotten all of the details right to be absolutely portable everywhere,
so I wouldn't be surprised to see this break on other systems.
Hopefully that will be a matter of adjusting the configure tests a bit
rather than coping with substantive changes in available
functionality, but we'll see.

Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

dynshmem-v1.patchapplication/octet-stream; name=dynshmem-v1.patchDownload+2353-29
#2Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: dynamic shared memory

Hi Robert,

[just sending an email which sat in my outbox for two weeks]

On 2013-08-13 21:09:06 -0400, Robert Haas wrote:

...

Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.

I've not yet looked at the code, so I just have some highlevel comments
so far.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly.

That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type.

I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.

Hm. Not sure what's the advantage over a corefile here.

On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.

Shouldn't we ourselves know which segments are around?

Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.

I don't think we should try to work around limits like that.

A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.

-1 so far. Seems a bit handwavy to me.

Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other.

It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)

There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch.

Agreed.

Second, it doesn't make any> policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.

At this point I am rather unconcerned with this point to be
honest.

--- /dev/null
+++ b/src/include/storage/dsm.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm.h
+ *	  manage dynamic shared memory segments
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/dsm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSM_H
+#define DSM_H
+
+#include "storage/dsm_impl.h"
+
+typedef struct dsm_segment dsm_segment;
+
+/* Initialization function. */
+extern void dsm_postmaster_startup(void);
+
+/* Functions that create, update, or remove mappings. */
+extern dsm_segment *dsm_create(uint64 size, char *preferred_address);
+extern dsm_segment *dsm_attach(dsm_handle h, char *preferred_address);
+extern void *dsm_resize(dsm_segment *seg, uint64 size,
+		   char *preferred_address);
+extern void *dsm_remap(dsm_segment *seg, char *preferred_address);
+extern void dsm_detach(dsm_segment *seg);

Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: dynamic shared memory

On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:

[just sending an email which sat in my outbox for two weeks]

Thanks for taking a look.

Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.

Yeah, I hope so. The applications may be somewhat limited by the fact
that there are apparently fairly small limits to how many shared
memory segments you can map at the same time. I believe on one system
I looked at (some version of HP-UX?) the limit was 11. So we won't be
able to go nuts with this: using it definitely introduces all kinds of
failure modes that we don't have it today. But it will also let us do
some pretty cool things that we CAN'T do today.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly.

That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?

The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type.

I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.

Hm. Not sure what's the advantage over a corefile here.

You can look at it while the server's running.

On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.

Shouldn't we ourselves know which segments are around?

Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.

Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.

I don't think we should try to work around limits like that.

I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person. There are multiple ways to create shared
memory, and they all have different limits. Normally, System V limits
are small, POSIX limits are large, and the inherited-anonymous-mapping
trick we're now using for the main shared memory segment has no limits
at all. It's very common to run into a system where you can allocate
huge numbers of gigabytes of backend-private memory, but if you try to
allocate 64MB of *shared* memory, you get the axe - or maybe not,
depending on which API you use to create it.

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option. If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)

OK, well that's another reason I didn't do it that way, then. :-)

At this point I am rather unconcerned with this point to be
honest.

I think that's appropriate; mostly, I wanted to emphasize that the
wisdom of allocating any given amount of shared memory is outside the
scope of this patch, which only aims to provide mechanism, not policy.

Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer. But even if it isn't reliable, there doesn't
seem to be all that much value in forbidding access to that part of
the OS-provided API. In the world where it's not reliable, it may
still be convenient to map things at the same address when you can, so
that pointers can't be used. Of course you'd have to have some
fallback strategy for when you don't get the same mapping, and maybe
that's painful enough that there's no point after all. Or maybe it's
worth having one code path for relativized pointers and another for
non-relativized pointers.

To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#1)
Re: dynamic shared memory

On 8/13/13 8:09 PM, Robert Haas wrote:

is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).

To clarify... you're talking something that would intentionally survive postmaster restart? I don't see use for that either...

postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).

Ignorant question... is ResourceOwner related to memory contexts? If not, would memory contexts be a better way to handle memory segment cleanup?

There are quite a few problems that this patch does not solve. First,

It also doesn't provide any mechanism for notifying backends of a new segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be useful to have a standard method or three of doing that; perhaps just some convenience functions wrapping the methods mentioned in comments.

Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.

Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared memory is something we've needed forever! :)

Other comments...

+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk.  In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.

I'm a bit concerned about this; I know it was possible in older versions for the global shared memory context to be left behind after a crash and needing to clean it up by hand. Dynamic shared mem potentially multiplies that by 100 or more. I think it'd be worth changing dsm_write_state_file so it always writes a new file and then does an atomic mv (or something similar).

+	 * If some other backend exited uncleanly, it might have corrupted the
+	 * control segment while it was dying.  In that case, we warn and ignore
+	 * the contents of the control segment.  This may end up leaving behind
+	 * stray shared memory segments, but there's not much we can do about
+	 * that if the metadata is gone.

Similar concern... in this case, would it be possible to always write updates to an un-used slot and then atomically update a pointer? This would be more work than what I suggested above, so maybe just a TODO for now...

Though... is there anything a dying backend could do that would corrupt the control segment to the point that it would screw up segments allocated by other backends and not related to the dead backend? Like marking a slot as not used when it is still in use and isn't associated with the dead backend? (I'm assuming that if a backend dies unexpectedly then all other backends using memory shared with that backend will need to handle themselves accordingly so that we don't need to worry about that in dsm.c.)

I was able to simplify dsm_create a bit (depending on your definition of simplify...) not sure if the community is OK with using an ereport to exit a loop (that could safely go outside the loop though...). In any case, I traded 5 lines of (mostly) duplicate code with an if{} and a break:

+	nitems = dsm_control->nitems;
+	for (i = 0; i <= nitems; ++i) /* Intentionally go one slot past what's currently been allocated */
+	{
+		if (dsm_control->item[i].refcnt == 0)
+		{
+			dsm_control->item[i].handle = seg->handle;
+			/* refcnt of 1 triggers destruction, so start at 2 */
+			dsm_control->item[i].refcnt = 2;
+			seg->control_slot = i;
+			if (i = nitems) /* We hit the end of the list */
+			{
+				/* Verify that we can support an additional mapping. */
+				if (nitems >= dsm_control->maxitems)
+					ereport(ERROR,
+							(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+							errmsg("too many dynamic shared memory segments")));
+
+				dsm_control->nitems++;
+			}
+			break;
+		}
+	}
+
+	LWLockRelease(DynamicSharedMemoryControlLock);
+	return seg;

Should this (in dsm_attach)

+ * If you're hitting this error, you probably want to use attempt to

be

+ * If you're hitting this error, you probably want to attempt to

?

Should dsm_impl_op sanity check the arguments after op? I didn't notice checks in the type-specific code but I also didn't read all of it... are we just depending on the OS to sanity-check?

Also, does the GUC code enforce that the GUC must always be something that's supported? If not then the error in dsm_impl_op should be more user-friendly.

I basically stopped reading after dsm_impl_op... the rest of the stuff was rather over my head.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#3)
Re: dynamic shared memory

Hi,

On 2013-08-28 15:20:57 -0400, Robert Haas wrote:

That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?

The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.

The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.

I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

But then, it made them fix the issue afaik :P

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.

Hm. Not sure what's the advantage over a corefile here.

You can look at it while the server's running.

That's what debuggers are for.

On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.

Shouldn't we ourselves know which segments are around?

Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.

But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.

Another use case is working around an administrator-imposed or
OS-imposed shared memory limit. If you're not allowed to allocate
shared memory, but you are allowed to create files, then this
implementation will let you use whatever facilities we build on top
of dynamic shared memory anyway.

I don't think we should try to work around limits like that.

I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person.

"Let's drive users away from sysv shem" is the only one I heard so far ;)

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]

But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.

If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.

I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option.

Well, most of them actually give different guarantees, so it makes sense
to have differing performance...

Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#5)
Re: dynamic shared memory

On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

On 2013-08-28 15:20:57 -0400, Robert Haas wrote:

That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?

The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.

The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.

I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

But then, it made them fix the issue afaik :P

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging.

Hm. Not sure what's the advantage over a corefile here.

You can look at it while the server's running.

That's what debuggers are for.

On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.

Shouldn't we ourselves know which segments are around?

Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.

But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.

Another use case is working around an administrator-imposed or
OS-imposed shared memory limit. If you're not allowed to allocate
shared memory, but you are allowed to create files, then this
implementation will let you use whatever facilities we build on top
of dynamic shared memory anyway.

I don't think we should try to work around limits like that.

I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person.

"Let's drive users away from sysv shem" is the only one I heard so far ;)

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]

But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.

If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.

I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option.

Well, most of them actually give different guarantees, so it makes sense
to have differing performance...

Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

For Windows, I believe we are already doing something similar
(attaching at predefined address) in main shared
memory. It reserves memory at particular address using
pgwin32_ReserveSharedMemoryRegion() before actually
starting (resuming process created in suspend mode) a process and
then after starting backend attaches at same
address (PGSharedMemoryReAttach).

I think one question here is what is use of exposing
preffered_address, to which I can think of only below:

a. Base OS API's provide such provision, then why don't we?
b. While browsing, I found few examples in IBM site where they also
show usage with preferred address.
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm
c. If user wishes to attach segments at same base address, so that
it can access pointers in the memory mapped
file which otherwise would not be possible.

But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#4)
Re: dynamic shared memory

On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim@nasby.net> wrote:

On 8/13/13 8:09 PM, Robert Haas wrote:

is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).

To clarify... you're talking something that would intentionally survive
postmaster restart? I don't see use for that either...

No, I meant something that would live as long as the postmaster and
die when it dies.

Ignorant question... is ResourceOwner related to memory contexts? If not,
would memory contexts be a better way to handle memory segment cleanup?

Nope. :-)

There are quite a few problems that this patch does not solve. First,

It also doesn't provide any mechanism for notifying backends of a new
segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
useful to have a standard method or three of doing that; perhaps just some
convenience functions wrapping the methods mentioned in comments.

I don't see that as being generally useful. Backends need to know
more than "there's a new segment", and in fact most backends won't
care about most new segments. A background worker needs to know about
the new segment *that it should attach*, but we have bgw_main_arg. If
we end up using this facility for any system-wide purposes, I imagine
we'll do that by storing the segment ID in the main shared memory
segment someplace.

Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
memory is something we've needed forever! :)

Thanks.

Other comments...

+ * If the state file is empty or the contents are garbled, it probably
means
+ * that the operating system rebooted before the data written by the
previous
+ * postmaster made it to disk.  In that case, we can just ignore it; any
shared
+ * memory from before the reboot should be gone anyway.

I'm a bit concerned about this; I know it was possible in older versions for
the global shared memory context to be left behind after a crash and needing
to clean it up by hand. Dynamic shared mem potentially multiplies that by
100 or more. I think it'd be worth changing dsm_write_state_file so it
always writes a new file and then does an atomic mv (or something similar).

I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that. However, I don't agree that writing the state file in a
different way would improve anything.

+        * If some other backend exited uncleanly, it might have corrupted
the
+        * control segment while it was dying.  In that case, we warn and
ignore
+        * the contents of the control segment.  This may end up leaving
behind
+        * stray shared memory segments, but there's not much we can do
about
+        * that if the metadata is gone.

Similar concern... in this case, would it be possible to always write
updates to an un-used slot and then atomically update a pointer? This would
be more work than what I suggested above, so maybe just a TODO for now...

Though... is there anything a dying backend could do that would corrupt the
control segment to the point that it would screw up segments allocated by
other backends and not related to the dead backend? Like marking a slot as
not used when it is still in use and isn't associated with the dead backend?

Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.

Should this (in dsm_attach)

+ * If you're hitting this error, you probably want to use attempt to

be

+ * If you're hitting this error, you probably want to attempt to

?

Good point.

Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are we
just depending on the OS to sanity-check?

Sanity-check for what?

Also, does the GUC code enforce that the GUC must always be something that's
supported? If not then the error in dsm_impl_op should be more
user-friendly.

Yes.

I basically stopped reading after dsm_impl_op... the rest of the stuff was
rather over my head.

:-)

Thanks for your interest!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: dynamic shared memory

On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:

The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.

The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.

Right. I had those considerations in mind and I believe I have nailed
the hatch shut pretty tight. The cleanup code is designed never to
die with an error. Of course it might, but it would have to be
something like an out of memory failure or similar that isn't really
what we're concerned about here. You are welcome to look for holes,
but these issues are where most of my brainpower went during
development.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

But then, it made them fix the issue afaik :P

Pah. :-)

You can look at it while the server's running.

That's what debuggers are for.

Tough crowd. I like it. YMMV.

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. [...]

But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.

I see. There might be additional writeback traffic, but it might not
be that bad in common cases. After all the data's pretty hot.

If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.

OK, how about this.... SysV doesn't allow extending segments, but
mmap does. The thing here is that you're saying "remove mmap and keep
sysv" but Noah suggested to me that we remove sysv and keep mmap.
This suggests to me that the picture is not so black and white as you
think it is.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In
the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

Maybe. I think for the amount of code we're talking about here, it's
not worth getting excited about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#8)
Re: dynamic shared memory

On Sat, Aug 31, 2013 at 08:27:14AM -0400, Robert Haas wrote:

On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.

I somewhat disfavor having a vague "preferred_address" parameter. mmap()'s
first argument is specified that way, but mmap()'s specification caters to an
open-ended range of implementations and clients. A PostgreSQL backend
interface can be more rigid. If we choose to support fixed-address callers,
let those receive either the requested address or an ereport(ERROR). If the
caller does not care, make no effort to provide a consistent address. (Better
still, under --enable-cassert, try to force the address to differ across
processes.)

[quotations reordered]

But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In

That's also valid, though. Even if no core code exploits the flexibility,
3rd-party code might do so.

the world where it's not reliable, it may still be convenient to map
things at the same address when you can, so that pointers can't be
used. Of course you'd have to have some fallback strategy for when
you don't get the same mapping, and maybe that's painful enough that
there's no point after all. Or maybe it's worth having one code path
for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

I agree. It would take an exceptional use case to justify such parallel code
paths; I won't expect that to ever happen for core code.

I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.

A system where mmap() accepts only a zero address in practice (HP-UX,
according to Gnulib, although HP docs suggest it has improved over time)
requires a different technique. For those systems, expand the regular shared
memory segment and carve from that to make "dynamic" segments. This amounts
to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
use this implementation, its disadvantages would be sufficient to discard the
whole idea of reliable fixed addresses. But I find it acceptable if it's a
crutch for older kernels, rare hardware, etc.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#9)
Re: dynamic shared memory

Hi Noah,

On 2013-09-01 09:24:00 -0400, Noah Misch wrote:

But even if it isn't reliable, there doesn't seem to be all that much
value in forbidding access to that part of the OS-provided API. In

That's also valid, though. Even if no core code exploits the flexibility,
3rd-party code might do so.

It seems more likely that 3rd party code misunderstands the
limitations. But perhaps that's being too picky.

I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair

On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.

Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...

If that's the solution we go for why don't we just always include heaps
more shared memory and use that (remapping part of it as PROT_NONE)
instead of building the infrastructure in this patch?

A system where mmap() accepts only a zero address in practice (HP-UX,
according to Gnulib, although HP docs suggest it has improved over time)
requires a different technique. For those systems, expand the regular shared
memory segment and carve from that to make "dynamic" segments. This amounts
to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
use this implementation, its disadvantages would be sufficient to discard the
whole idea of reliable fixed addresses. But I find it acceptable if it's a
crutch for older kernels, rare hardware, etc.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.
If you go for allocating from the start what you end up will not
actually be "dynamic shared memory" because you cannot allocate much
(even if not actually backed by memory!) without compromising
performance. So we will end up with a configurable
"dynamic_shared_memory = ..." parameter. And even then, all processes,
even those not actually using the "dynamic" memory, will have to pay the
price of not being able to allocate as much memory as otherwise possible.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#10)
Re: dynamic shared memory

On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:

On 2013-09-01 09:24:00 -0400, Noah Misch wrote:

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair

Non-Windows EXEC_BACKEND is already facing a dead end that way.

On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.

Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...

I believe you, but I'd appreciate a link to the discussion you have in mind.

If that's the solution we go for why don't we just always include heaps
more shared memory and use that (remapping part of it as PROT_NONE)
instead of building the infrastructure in this patch?

There would be no freeing of the memory; a usage high water mark would stand
for the life of the postmaster.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.

True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.

I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#11)
Re: dynamic shared memory

Hi Noah!

On 2013-09-01 12:07:04 -0400, Noah Misch wrote:

On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:

On 2013-09-01 09:24:00 -0400, Noah Misch wrote:

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair

Non-Windows EXEC_BACKEND is already facing a dead end that way.

Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.

On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.

Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...

I believe you, but I'd appreciate a link to the discussion you have in mind.

Unfortunately I could only find the first half of the discussion about
the issue. Turns out it's not the greatest idea to name your fancy new
programming language "go" (yesyes, petpeeve of mine).

http://lkml.org/lkml/2011/2/8/118
https://lwn.net/Articles/428100/

So, after reading up on the issue a bit more and reading some more
kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
problems except counting in ulimit -v. It will *not* cause overcommit
violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
faulted. Which means that to be reliable and not violate overcommit we'd
need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
immediately (without interceding mallocs, using mmap itself) map it again.

It only gets really expensive in the sense of making fork expensive if
you set protections on many regions in that mapping individually. Each
mprotect() call will split the VMA into distinct pieces and they won't
get merged even if there are neighboors with the same settings.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.

True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.

Not sure. I think that will partially depend on whether x32 will have
any success which I still find hard to judge.

I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.

I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.

In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: dynamic shared memory

On Mon, Sep 2, 2013 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.

I'm not sure what Noah was getting at, but I have used EXEC_BACKEND
twice now during development, in situations where I would have needed
a Windows development otherwise. So it's definitely useful, at least
to me. But on my MacBook Pro, you have to compile it with -fno-pie (I
think that's the right flag) to disable ASLR in order to get reliable
operation. I imagine such problems will become commonplace on more
and more platforms as time wears on.

I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.

In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.

I didn't intend to punt that decision to another layer so much as
another patch and a more detailed examination of requirements. IME,
given a choice between something that is 99% reliable and provides
more functionality, or something that is 99.99% reliable and provides
less functionality, this community picks the latter every time. And
that's why I've left out any capability to insist on a fixed address
from this patch. It would be nice to have, to be sure. But it also
would take more work and add more complexity, and I don't have a clear
sense that that work would be justified.

Now, we might get to a point where it seems clear that we're not going
to get any further with parallelism without adding a capability for
fixed-address mappings. If that happens, I think that's the time to
come back to this layer and add that capability. But right now it
doesn't seem essential. Now, having said that, I didn't see any
particular reason to bury the ability to pass mmap() or shmat() a
*preferred* address. But IJWH.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#12)
Re: dynamic shared memory

On Tue, Sep 03, 2013 at 12:52:22AM +0200, Andres Freund wrote:

On 2013-09-01 12:07:04 -0400, Noah Misch wrote:

On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:

On 2013-09-01 09:24:00 -0400, Noah Misch wrote:

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair

Non-Windows EXEC_BACKEND is already facing a dead end that way.

Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.

It already has problems: ASLR measures sometimes prevent reattachment of the
main shared memory segment. Multiplying the combined size of our
fixed-address mappings does not push us over some threshold where this becomes
a problem, because it is already a problem.

Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards.

So, after reading up on the issue a bit more and reading some more
kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
problems except counting in ulimit -v. It will *not* cause overcommit
violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
faulted. Which means that to be reliable and not violate overcommit we'd
need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
immediately (without interceding mallocs, using mmap itself) map it again.

It only gets really expensive in the sense of making fork expensive if
you set protections on many regions in that mapping individually. Each
mprotect() call will split the VMA into distinct pieces and they won't
get merged even if there are neighboors with the same settings.

Thanks for researching that.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.

True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.

Not sure. I think that will partially depend on whether x32 will have
any success which I still find hard to judge.

I won't hold my breath for x32 becoming a common platform for high-memory
database servers, regardless of other successes it might find. Not
impossible, but I recommend placing trivial priority on maximizing performance
for that scenario.

I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.

I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.

I agree.

In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.

We don't need fixed addresses. Parallel internal sort will probably include
the equivalent of a SortTuple array in its shared memory segment, and that
implies relative pointers to the tuples also stored in shared memory. I
expect that wart to be fairly isolated within the code, so little harm done.

I don't think we will have at all painted ourselves into a corner, should we
wish to lift the limitation later.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#7)
Re: dynamic shared memory

On 8/31/13 7:17 AM, Robert Haas wrote:

On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim@nasby.net> wrote:

On 8/13/13 8:09 PM, Robert Haas wrote:

is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that).

To clarify... you're talking something that would intentionally survive
postmaster restart? I don't see use for that either...

No, I meant something that would live as long as the postmaster and
die when it dies.

ISTM that at some point we'll want to look at putting top-level shared memory into this system (ie: allowing dynamic resizing of GUCs that affect shared memory size).

But as you said, it'd be trivial to add that later.

Other comments...

+ * If the state file is empty or the contents are garbled, it probably
means
+ * that the operating system rebooted before the data written by the
previous
+ * postmaster made it to disk.  In that case, we can just ignore it; any
shared
+ * memory from before the reboot should be gone anyway.

I'm a bit concerned about this; I know it was possible in older versions for
the global shared memory context to be left behind after a crash and needing
to clean it up by hand. Dynamic shared mem potentially multiplies that by
100 or more. I think it'd be worth changing dsm_write_state_file so it
always writes a new file and then does an atomic mv (or something similar).

I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that. However, I don't agree that writing the state file in a
different way would improve anything.

Wouldn't it protect against a crash while writing the file? I realize the odds of that are pretty remote, but AFAIK it wouldn't cost that much to write a new file and do an atomic mv...

+        * If some other backend exited uncleanly, it might have corrupted
the
+        * control segment while it was dying.  In that case, we warn and
ignore
+        * the contents of the control segment.  This may end up leaving
behind
+        * stray shared memory segments, but there's not much we can do
about
+        * that if the metadata is gone.

Similar concern... in this case, would it be possible to always write
updates to an un-used slot and then atomically update a pointer? This would
be more work than what I suggested above, so maybe just a TODO for now...

Though... is there anything a dying backend could do that would corrupt the
control segment to the point that it would screw up segments allocated by
other backends and not related to the dead backend? Like marking a slot as
not used when it is still in use and isn't associated with the dead backend?

Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.

Are we trying to protect against "memory stomps" when we restart after a backend dies? I thought we were just trying to ensure that all shared data structures were correct and consistent. If that's the case, then I was thinking that by using a pointer that can be updated in a CPU-atomic fashion we know we'd never end up with a corrupted entry that was in use; the partial write would be to a slot with nothing pointing at it so it could be safely reused.

Like I said before though, it may not be worth worrying about this case right now.

Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are we
just depending on the OS to sanity-check?

Sanity-check for what?

Presumably there's limits to what the arguments can be rationally set to. IIRC there's nothing down-stream that's checking them in our code, so I'm guessing we're just depending on the kernel to sanity-check.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#15)
Re: dynamic shared memory

On Wed, Sep 4, 2013 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote:

No, I meant something that would live as long as the postmaster and
die when it dies.

ISTM that at some point we'll want to look at putting top-level shared
memory into this system (ie: allowing dynamic resizing of GUCs that affect
shared memory size).

A lot of people want that, but being able to resize the shared memory
chunk itself is only the beginning of the problem. So I wouldn't hold
my breath.

Wouldn't it protect against a crash while writing the file? I realize the
odds of that are pretty remote, but AFAIK it wouldn't cost that much to
write a new file and do an atomic mv...

If there's an OS-level crash, we don't need the state file; the shared
memory will be gone anyway. And if it's a PostgreSQL-level failure,
this game neither helps nor hurts.

Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.

Are we trying to protect against "memory stomps" when we restart after a
backend dies? I thought we were just trying to ensure that all shared data
structures were correct and consistent. If that's the case, then I was
thinking that by using a pointer that can be updated in a CPU-atomic fashion
we know we'd never end up with a corrupted entry that was in use; the
partial write would be to a slot with nothing pointing at it so it could be
safely reused.

When we restart after a backend dies, shared memory contents are
completely reset, from scratch. This is true of both the fixed size
shared memory segment and of the dynamic shared memory control
segment. The only difference is that, with the dynamic shared memory
control segment, we need to use the segment for cleanup before
throwing it out and starting over. Extra caution is required because
we're examining memory that could hypothetically have been stomped on;
we must not let the postmaster do anything suicidal.

Should dsm_impl_op sanity check the arguments after op? I didn't notice
checks in the type-specific code but I also didn't read all of it... are
we
just depending on the OS to sanity-check?

Sanity-check for what?

Presumably there's limits to what the arguments can be rationally set to.
IIRC there's nothing down-stream that's checking them in our code, so I'm
guessing we're just depending on the kernel to sanity-check.

Pretty much. It's possible more thought is needed there, but the
shape of those additional thoughts is not clear to me at this time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#16)
Re: dynamic shared memory

On 9/5/13 11:37 AM, Robert Haas wrote:

ISTM that at some point we'll want to look at putting top-level shared

memory into this system (ie: allowing dynamic resizing of GUCs that affect
shared memory size).

A lot of people want that, but being able to resize the shared memory
chunk itself is only the beginning of the problem. So I wouldn't hold
my breath.

<starts breathing again>

Wouldn't it protect against a crash while writing the file? I realize the
odds of that are pretty remote, but AFAIK it wouldn't cost that much to
write a new file and do an atomic mv...

If there's an OS-level crash, we don't need the state file; the shared
memory will be gone anyway. And if it's a PostgreSQL-level failure,
this game neither helps nor hurts.

Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.

Are we trying to protect against "memory stomps" when we restart after a
backend dies? I thought we were just trying to ensure that all shared data
structures were correct and consistent. If that's the case, then I was
thinking that by using a pointer that can be updated in a CPU-atomic fashion
we know we'd never end up with a corrupted entry that was in use; the
partial write would be to a slot with nothing pointing at it so it could be
safely reused.

When we restart after a backend dies, shared memory contents are
completely reset, from scratch. This is true of both the fixed size
shared memory segment and of the dynamic shared memory control
segment. The only difference is that, with the dynamic shared memory
control segment, we need to use the segment for cleanup before
throwing it out and starting over. Extra caution is required because
we're examining memory that could hypothetically have been stomped on;
we must not let the postmaster do anything suicidal.

Not doing something suicidal is what I'm worried about (that and not cleaning up as well as possible).

The specific scenario I'm worried about is something like a PANIC in the middle of the snprintf call in dsm_write_state_file(). That would leave that file in a completely unknown state so who knows what would then happen on restart. ISTM that writing a temp file and then doing a filesystem mv would eliminate that issue.

Or is it safe to assume that the snprintf call will be atomic since we're just spitting out a long?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#17)
Re: dynamic shared memory

On Fri, Sep 6, 2013 at 3:40 PM, Jim Nasby <jim@nasby.net> wrote:

The specific scenario I'm worried about is something like a PANIC in the
middle of the snprintf call in dsm_write_state_file(). That would leave that
file in a completely unknown state so who knows what would then happen on
restart. ISTM that writing a temp file and then doing a filesystem mv would
eliminate that issue.

Doing an atomic rename would eliminate the possibility of seeing a
partially written file, but a partially written file is mostly
harmless: we'll interpret whatever bytes we see as as integer and try
to use that as a DSM key. Then we'll just see that no such shared
memory key exists (probably) or that we don't own it (probably) or
that it doesn't look like a valid control segment (probably) and
ignore it.

If someone does a kill -9 the postmaster in the middle of write()
creating a partially written file, and the partially written file
happens to identify another shared memory segment owned by the same
user ID with the correct magic number and header contents to be
interpreted as a control segment, then we will indeed erroneously blow
away that purported control segment and all other segments to which it
points. I suppose we can stick in a rename() there just to completely
rule out that scenario, but it's pretty bloody unlikely anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#18)
Re: dynamic shared memory

OK, here's v2 of this patch, by myself and Amit Kapila. We made the
following changes:

- Added support for Windows. This necessitated adding an impl_private
parameter to dsm_impl_op.
- Since we had impl_private anyway, I used it to implement shm
identifier caching for the System V implementation. I like how that
turned out better than the previous version; YMMV.
- Fixed typo noted by Jim Nasby.
- Removed preferred_address parameter, per griping from Andres and Noah.
- Removed use of posix_fallocate, per recent commits.
- Added use of rename() so that we won't ever see a partially-written
state file, per griping by Jim Nasby.
- Added an overflow check so that if a user of a 32-bit system asks
for 4.1GB of dynamic shared memory, they get an error instead of
getting .1GB of memory.

Despite Andres's comments, I did not remove the mmap implementation or
the GUC that allows users to select which implementation they care to
use. I still think those things are useful. While I appreciate that
there's a marginal cost in complexity to each new GUC, I also don't
think it pays to get too cheap. There's a difference between not
requiring users to configure things that they shouldn't have to
configure, and not letting them configure things they might want to
configure. Besides, having a GUC also provides a way of turning the
feature completely off, which seems justified, at least for now, on
the basis of the (ahem) limited number of users of this facility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

dynshmem-v2.patchapplication/octet-stream; name=dynshmem-v2.patchDownload+2575-29
#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#19)
Re: dynamic shared memory

Hi Robert, Hi Amit,

Ok, first read through the patch.

On 2013-09-13 15:32:36 -0400, Robert Haas wrote:

-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])

Maybe also check for shm_unlink or is that too absurd?

--- /dev/null
+++ b/src/backend/storage/ipc/dsm.c
+#define PG_DYNSHMEM_STATE_FILE			PG_DYNSHMEM_DIR "/state"
+#define PG_DYNSHMEM_NEW_STATE_FILE		PG_DYNSHMEM_DIR "/state.new"

Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?

+struct dsm_segment
+{
+	dlist_node	node;				/* List link in dsm_segment_list. */
+	ResourceOwner resowner;			/* Resource owner. */
+	dsm_handle	handle;				/* Segment name. */
+	uint32		control_slot;		/* Slot in control segment. */
+	void       *impl_private;		/* Implementation-specific private data. */
+	void	   *mapped_address;		/* Mapping address, or NULL if unmapped. */
+	uint64		mapped_size;		/* Size of our mapping. */
+};

Document that's backend local?

+typedef struct dsm_control_item
+{
+	dsm_handle	handle;
+	uint32		refcnt;				/* 2+ = active, 1 = moribund, 0 = gone */
+} dsm_control_item;
+
+typedef struct dsm_control_header
+{
+	uint32		magic;
+	uint32		nitems;
+	uint32		maxitems;
+	dsm_control_item	item[FLEXIBLE_ARRAY_MEMBER];
+} dsm_control_header;

And those are shared memory?

+static void dsm_cleanup_using_control_segment(void);
+static void dsm_cleanup_for_mmap(void);
+static bool dsm_read_state_file(dsm_handle *h);
+static void dsm_write_state_file(dsm_handle h);
+static void dsm_postmaster_shutdown(int code, Datum arg);
+static void dsm_backend_shutdown(int code, Datum arg);
+static dsm_segment *dsm_create_descriptor(void);
+static bool dsm_control_segment_sane(dsm_control_header *control,
+						 uint64 mapped_size);
+static uint64 dsm_control_bytes_needed(uint32 nitems);
+
+/* Has this backend initialized the dynamic shared memory system yet? */
+static bool dsm_init_done = false;
+
+/*
+ * List of dynamic shared memory segments used by this backend.
+ *
+ * At process exit time, we must decrement the reference count of each
+ * segment we have attached; this list makes it possible to find all such
+ * segments.
+ *
+ * This list should always be empty in the postmaster.  We could probably
+ * allow the postmaster to map dynamic shared memory segments before it
+ * begins to start child processes, provided that each process adjusted
+ * the reference counts for those segments in the control segment at
+ * startup time, but there's no obvious need for such a facility, which
+ * would also be complex to handle in the EXEC_BACKEND case.  Once the
+ * postmaster has begun spawning children, there's an additional problem:
+ * each new mapping would require an update to the control segment,
+ * which requires locking, in which the postmaster must not be involved.
+ */
+static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list);
+
+/*
+ * Control segment information.
+ *
+ * Unlike ordinary shared memory segments, the control segment is not
+ * reference counted; instead, it lasts for the postmaster's entire
+ * life cycle.  For simplicity, it doesn't have a dsm_segment object either.
+ */
+static dsm_handle dsm_control_handle;
+static dsm_control_header *dsm_control;
+static uint64 dsm_control_mapped_size = 0;
+static void	*dsm_control_impl_private = NULL;
+
+/*
+ * Start up the dynamic shared memory system.
+ *
+ * This is called just once during each cluster lifetime, at postmaster
+ * startup time.
+ */
+void
+dsm_postmaster_startup(void)
+{
+	void	   *dsm_control_address = NULL;
+	uint32		maxitems;
+	uint64		segsize;
+
+	Assert(!IsUnderPostmaster);
+
+	/* If dynamic shared memory is disabled, there's nothing to do. */
+	if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+		return;
+
+	/*
+	 * Check for, and remove, shared memory segments left behind by a dead
+	 * postmaster.  This isn't necessary on Windows, which always removes them
+	 * when the last reference is gone.
+	 */
+	switch (dynamic_shared_memory_type)
+	{
+		case DSM_IMPL_POSIX:
+		case DSM_IMPL_SYSV:
+			dsm_cleanup_using_control_segment();
+			break;
+		case DSM_IMPL_MMAP:
+			dsm_cleanup_for_mmap();
+			break;
+		case DSM_IMPL_WINDOWS:
+			/* Nothing to do. */
+			break;
+		default:
+			elog(ERROR, "unknown dynamic shared memory type: %d",
+				 dynamic_shared_memory_type);
+	}
+
+	/* Determine size for new control segment. */
+	maxitems = PG_DYNSHMEM_FIXED_SLOTS
+		+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

It seems likely that MaxConnections would be sufficient?

+	elog(DEBUG2, "dynamic shared memory system will support %lu segments",
+		(unsigned long) maxitems);
+	segsize = dsm_control_bytes_needed(maxitems);
+
+	/* Create new control segment. */
+	for (;;)
+	{
+		Assert(dsm_control_address == NULL);
+		Assert(dsm_control_mapped_size == 0);
+		dsm_control_handle = random();
+		if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
+						&dsm_control_impl_private, &dsm_control_address,
+						&dsm_control_mapped_size, ERROR))
+			break;
+	}

Comment that we loop endlessly to find a unused identifier.

Why do we create the control segment in dynamic smem and not in the
normal shmem? Presumably because this way it has the same lifetime? If
so, that should be commented upon.

+static void
+dsm_cleanup_using_control_segment(void)
+{
+	/*
+	 * We've managed to reattach it, but the contents might not be sane.
+	 * If they aren't, we disregard the segment after all.
+	 */
+	old_control = (dsm_control_header *) mapped_address;
+	if (!dsm_control_segment_sane(old_control, mapped_size))
+	{
+		dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
+					&mapped_address, &mapped_size, LOG);
+		return;
+	}

So, we leave it just hanging around... Well, it has precedent in our
normal shared memory handling.

+static void
+dsm_cleanup_for_mmap(void)
+{

...

+}

I still maintain that the extra infrastructure required isn't worth the
gain of having the mmap implementation.

+/*
+ * Read and parse the state file.
+ *
+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk.  In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.
+ */
+static bool
+dsm_read_state_file(dsm_handle *h)
+{

...

+ return true;
+}

Perhaps CRC32 the content?

+/*
+ * Write our control segment handle to the state file, so that if the
+ * postmaster is killed without running it's on_shmem_exit hooks, the
+ * next postmaster can clean things up after restart.
+ */
+static void
+dsm_write_state_file(dsm_handle h)
+{
+	int			statefd;
+	char		statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
+	int			nbytes;
+
+	/* Create or truncate the file. */
+	statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);

Doesn't this need a | PG_BINARY? Why are you using open() and not
BasicOpenFile or even OpenTransientFile?

+	/* Write contents. */
+	snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
+			 (unsigned long) dsm_control_handle);

Why are we upcasting the length of dsm_control_handle here? Also,
doesn't this need the usual UINT64_FORMAT thingy?

+/*
+ * At shutdown time, we iterate over the control segment and remove all
+ * remaining dynamic shared memory segments.  We avoid throwing errors here;
+ * the postmaster is shutting down either way, and this is just non-critical
+ * resource cleanup.
+ */
+static void
+dsm_postmaster_shutdown(int code, Datum arg)
+{
+	uint32		nitems;
+	uint32		i;
+	void	   *dsm_control_address;
+	void	   *junk_mapped_address = NULL;
+	void	   *junk_impl_private = NULL;
+	uint64		junk_mapped_size = 0;
+
+	/* If dynamic shared memory is disabled, there's nothing to do. */
+	if (dynamic_shared_memory_type == DSM_IMPL_NONE)
+		return;

But we don't even get calld in that case, right? You're only regitering
the on_shmem_exit() handler after the same check in startup?

+	/*
+	 * If some other backend exited uncleanly, it might have corrupted the
+	 * control segment while it was dying.  In that case, we warn and ignore
+	 * the contents of the control segment.  This may end up leaving behind
+	 * stray shared memory segments, but there's not much we can do about
+	 * that if the metadata is gone.
+	 */
+	nitems = dsm_control->nitems;
+	if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+	{
+		ereport(LOG,
+				(errmsg("dynamic shared memory control segment is corrupt")));
+		return;
+	}

I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)

+	/* Remove any remaining segments. */
+	for (i = 0; i < nitems; ++i)
+	{
+		dsm_handle	handle;
+
+		/* If the reference count is 0, the slot is actually unused. */
+		if (dsm_control->item[i].refcnt == 0)
+			continue;
+
+		/* Log debugging information. */
+		handle = dsm_control->item[i].handle;
+		elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
+			(unsigned long) handle);

I'd include the refcount here, might be helpful for debugging.

+		/* Destroy the segment. */
+		dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
+					&junk_mapped_address, &junk_mapped_size, LOG);
+	}
+
+	/* Remove the control segment itself. */
+	elog(DEBUG2,
+		 "cleaning up dynamic shared memory control segment with ID %lu",
+		 (unsigned long) dsm_control_handle);
+	dsm_control_address = dsm_control;
+	dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0,
+				&dsm_control_impl_private, &dsm_control_address,
+				&dsm_control_mapped_size, LOG);
+	dsm_control = dsm_control_address;
+
+	/* And, finally, remove the state file. */
+	if (unlink(PG_DYNSHMEM_STATE_FILE) < 0)
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not unlink file \"%s\": %m",
+					PG_DYNSHMEM_STATE_FILE)));
+}

Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.

+/*
+ * Prepare this backend for dynamic shared memory usage.  Under EXEC_BACKEND,
+ * we must reread the state file and map the control segment; in other cases,
+ * we'll have inherited the postmaster's mapping and global variables.
+ */
+static void
+dsm_backend_startup(void)
+{
+
+#ifdef EXEC_BACKEND
+	{
+		dsm_handle	control_handle;
+		void	   *control_address = NULL;
+
+		/* Read the control segment information from the state file. */
+		if (!dsm_read_state_file(&control_handle))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg("could not parse dynamic shared memory state file")));
+
+		/* Attach control segment. */
+		dsm_impl_op(DSM_OP_ATTACH, control_handle, 0,
+					&dsm_control_impl_private, &control_address,
+					&dsm_control_mapped_size, ERROR);
+		dsm_control_handle = control_handle;
+		dsm_control = control_address;
+		/* If control segment doesn't look sane, something is badly wrong. */
+		if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
+		{
+			dsm_impl_op(DSM_OP_DETACH, control_handle, 0,
+					    &dsm_control_impl_private, &control_address,
+						&dsm_control_mapped_size, WARNING);
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg("dynamic shared memory control segment is not valid")));
+		}

Imo that's a PANIC or at the very least a FATAL.

+
+/*
+ * Create a new dynamic shared memory segment.
+ */
+dsm_segment *
+dsm_create(uint64 size)
+{

...

+	/* Verify that we can support an additional mapping. */
+	if (nitems >= dsm_control->maxitems)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("too many dynamic shared memory segments")));

Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...

+dsm_segment *
+dsm_attach(dsm_handle h)
+{
+	/* Bump reference count for this segment in shared memory. */
+	LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+	nitems = dsm_control->nitems;
+	for (i = 0; i < nitems; ++i)
+	{
+		/* If the reference count is 0, the slot is actually unused. */
+		if (dsm_control->item[i].refcnt == 0)
+			continue;
+
+		/*
+		 * If the reference count is 1, the slot is still in use, but the
+		 * segment is in the process of going away.  Treat that as if we
+		 * didn't find a match.
+		 */
+		if (dsm_control->item[i].refcnt == 1)
+			break;

Why can we see that state? Shouldn't locking prevent that?

+/*
+ * Resize an existing shared memory segment.
+ *
+ * This may cause the shared memory segment to be remapped at a different
+ * address.  For the caller's convenience, we return the mapped address.
+ */
+void *
+dsm_resize(dsm_segment *seg, uint64 size)
+{
+	Assert(seg->control_slot != INVALID_CONTROL_SLOT);
+	dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
+				&seg->mapped_address, &seg->mapped_size, ERROR);
+	return seg->mapped_address;
+}

Hm. That's valid when there are other backends attached? What are the
implications for already attached ones?

Shouldn't we error out if !dsm_impl_can_resize()?

+/*
+ * Detach from a shared memory segment, destroying the segment if we
+ * remove the last reference.
+ *
+ * This function should never fail.  It will often be invoked when aborting
+ * a transaction, and a further error won't serve any purpose.  It's not a
+ * complete disaster if we fail to unmap or destroy the segment; it means a
+ * resource leak, but that doesn't necessarily preclude further operations.
+ */
+void
+dsm_detach(dsm_segment *seg)
+{

Why do we want to ignore errors like failing to unmap? ISTM that
indicates an actual problem...

+	/* Reduce reference count, if we previously increased it. */
+	if (seg->control_slot != INVALID_CONTROL_SLOT)
+	{
+		uint32	refcnt;
+		uint32	control_slot = seg->control_slot;
+
+		LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+		Assert(dsm_control->item[control_slot].handle == seg->handle);
+		Assert(dsm_control->item[control_slot].refcnt > 1);
+		refcnt = --dsm_control->item[control_slot].refcnt;
+		seg->control_slot = INVALID_CONTROL_SLOT;
+		LWLockRelease(DynamicSharedMemoryControlLock);
+
+		/* If new reference count is 1, try to destroy the segment. */
+		if (refcnt == 1)
+		{
+			/*
+			 * If we fail to destroy the segment here, or are killed before
+			 * we finish doing so, the reference count will remain at 1, which
+			 * will mean that nobody else can attach to the segment.  At
+			 * postmaster shutdown time, or when a new postmaster is started
+			 * after a hard kill, another attempt will be made to remove the
+			 * segment.
+			 *
+			 * The main case we're worried about here is being killed by
+			 * a signal before we can finish removing the segment.  In that
+			 * case, it's important to be sure that the segment still gets
+			 * removed. If we actually fail to remove the segment for some
+			 * other reason, the postmaster may not have any better luck than
+			 * we did.  There's not much we can do about that, though.
+			 */
+			if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, &seg->impl_private,
+							&seg->mapped_address, &seg->mapped_size, WARNING))
+			{
+				LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
+				Assert(dsm_control->item[control_slot].handle == seg->handle);
+				Assert(dsm_control->item[control_slot].refcnt == 1);
+				dsm_control->item[control_slot].refcnt = 0;
+				LWLockRelease(DynamicSharedMemoryControlLock);
+			}
+		}
+	}

Yuck. So that's the answer to my earlier question about the legality of
seing a refcount of 1....

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
new file mode 100644
index 0000000..8005fd9
@@ -0,0 +1,986 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsm_impl.c
+ *	  manage dynamic shared memory segments
+ *
+ * This file provides low-level APIs for creating and destroying shared
+ * memory segments using several different possible techniques.  We refer
+ * to these segments as dynamic because they can be created, altered, and
+ * destroyed at any point during the server life cycle.  This is unlike
+ * the main shared memory segment, of which there is always exactly one
+ * and which is always mapped at a fixed address in every PostgreSQL
+ * background process.
+ *
+ * Because not all systems provide the same primitives in this area, nor
+ * do all primitives behave the saem way on all systems, we provide

*same

+ * several implementations of this facility.  Many systems implement
+ * POSIX shared memory (shm_open etc.), which is well-suited to our needs
+ * in this area, with the exception that shared memory identifiers live
+ * in a flat system-wide namespace, raising the uncomfortable prospect of
+ * name collisions with other processes (including other copies of
+ * PostgreSQL) running on the same system.

Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.

+/*
+ * Does the current dynamic shared memory implementation support resizing
+ * segments?  (The answer here could be platform-dependent in the future,
+ * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently
+ * can't resize segments to anything larger than 256MB that way.  For now,
+ * we keep it simple.)
+ */
+bool
+dsm_impl_can_resize(void)
+{
+	switch (dynamic_shared_memory_type)
+	{
+		case DSM_IMPL_NONE:
+			return false;
+		case DSM_IMPL_SYSV:
+			return false;
+		case DSM_IMPL_WINDOWS:
+			return false;
+		default:
+			return true;
+	}
+}

Looks to me like the logic should be the reverse.

+#ifdef USE_DSM_POSIX
+/*
+ * Operating system primitives to support POSIX shared memory.
+ *
+ * POSIX shared memory segments are created and attached using shm_open()
+ * and shm_unlink(); other operations, such as sizing or mapping the
+ * segment, are performed as if the shared memory segments were files.
+ *
+ * Indeed, on some platforms, they may be implemented that way.  While
+ * POSIX shared memory segments seem intended to exist in a flat namespace,
+ * some operating systems may implement them as files, even going so far
+ * to treat a request for /xyz as a request to create a file by that name
+ * in the root directory.  Users of such broken platforms should select
+ * a different shared memory implementation.
+ */
+static bool
+dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
+			   void **impl_private, void **mapped_address, uint64 *mapped_size,
+			   int elevel)
+{
+	char	name[64];
+	int		flags;
+	int		fd;
+	char   *address;
+
+	snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);

Why wider than the handle?

+	/*
+	 * If we're reattaching or resizing, we must remove any existing mapping,
+	 * unless we've already got the right thing mapped.
+	 */
+	if (*mapped_address != NULL)
+	{
+		if (*mapped_size == request_size)
+			return true;

Hm. It could have gotten resized to the old size, or resized twice. In
that case it might not be at the same address before, so checking for
the size doesn't seem to be sufficient.

+static bool
+dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
+			  void **impl_private, void **mapped_address, uint64 *mapped_size,
+			  int elevel)
+{
+	/*
+	 * There's one special key, IPC_PRIVATE, which can't be used.  If we end
+	 * up with that value by chance during a create operation, just pretend
+	 * it already exists, so that caller will retry.  If we run into it
+	 * anywhere else, the caller has passed a handle that doesn't correspond
+	 * to anything we ever created, which should not happen.
+	 */
+	if (key == IPC_PRIVATE)
+	{
+		if (op != DSM_OP_CREATE)
+			elog(elevel, "System V shared memory key may not be IPC_PRIVATE");
+		errno = EEXIST;
+		return false;
+	}

Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
in ERROR?

+static bool
+dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
+			  void **impl_private, void **mapped_address, uint64 *mapped_size,
+			  int elevel)
+{
+	/*
+	 * If we're reattaching or resizing, we must remove any existing mapping,
+	 * unless we've already got the right thing mapped.
+	 */
+	if (*mapped_address != NULL)
+	{
+		if (*mapped_size == request_size)
+			return true;

Same think like in posix shmem.

+static int
+errcode_for_dynamic_shared_memory()
+{
+	if (errno == EFBIG || errno == ENOMEM)
+		return errcode(ERRCODE_OUT_OF_MEMORY);
+	else
+		return errcode_for_file_access();
+}

Is EFBIG guaranteed to be defined?

+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{
+	dsm_segment **dsms = owner->dsms;
+	int			ns1 = owner->ndsms - 1;
+	int			i;
+
+	for (i = ns1; i >= 0; i--)
+	{
+		if (dsms[i] == seg)
+		{
+			while (i < ns1)
+			{
+				dsms[i] = dsms[i + 1];
+				i++;
+			}
+			owner->ndsms = ns1;
+			return;
+		}
+	}
+	elog(ERROR,
+		 "dynamic shared memory segment %lu is not owned by resource owner %s",
+		 (unsigned long) dsm_segment_handle(seg), owner->name);
+}

Not really an issue, but this will grow owner->dsm unnecessarily because
ResourceOwnerEnlargeDSMs() will have been done previously.

Not sure yet how happy I am with the separation of concerns between
dsm.c and dsm_impl.c...

That's it for now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#22)
#24Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#23)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#24)
#26Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#25)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#22)
#29Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#28)
#31Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#28)
#32Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Noah Misch (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#32)
#34Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#27)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#31)
#36Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
#37Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#35)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#35)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#38)
#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#40)