BUG #14206: Switch to using POSIX semaphores on FreeBSD
The following bug has been logged on the website:
Bug reference: 14206
Logged by: Maksym Sobolyev
Email address: sobomax@freebsd.org
PostgreSQL version: 9.5.2
Operating system: FreeBSD 10.3-RELEASE amd64
Description:
Traditionally, SYSV semaphores are used to do synchronization on FreeBSD.
However, according to the analysis done by Konstantin Belousov here
https://www.kib.kiev.ua/kib/pgsql_perf_v2.0.pdf there is at the very least
some performance benefit on using POSIX semaphones instead of SYSV
semaphones in the PG running on FreeBSD host.
In addition to that performance benefit, the SYSV primitives are usually
very limited resource by default, so in order to run any more or less
significant amount of connections on your DB server you need to tweak kernel
option to increase number of those. And last but not least, SYSV primitives
once allocated need explicit removal, which might not be performed when PG
process dies or SIGKILLed. None of those is an issue with POSIX
semaphores.
We've been testing that patch on 9.1, 9.2 and 9.5 versions of the PG for few
weeks now and it performs at least as good as with old SYSV builds. We also
see drop of semaphores in use to 0 in the ipcs(1) output, so that the patch
actually does what it's supposed to do.
--- src/template/freebsd
+++ src/template/freebsd
@@ -3,3 +3,4 @@
case $host_cpu in
alpha*) CFLAGS="-O";; # alpha has problems with -O2
esac
+USE_NAMED_POSIX_SEMAPHORES=1
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
sobomax@freebsd.org writes:
However, according to the analysis done by Konstantin Belousov here
https://www.kib.kiev.ua/kib/pgsql_perf_v2.0.pdf there is at the very least
some performance benefit on using POSIX semaphones instead of SYSV
semaphones in the PG running on FreeBSD host.
I wonder how thorough that performance testing was. The reason that the
named-POSIX-semaphore code exists is that it used to be the only kind of
semaphore available on ancient OS X versions. But we got rid of that as
soon as we could, for the reason explained in template/darwin:
# Select appropriate semaphore support. Darwin 6.0 (Mac OS X 10.2) and up
# support System V semaphores; before that we have to use POSIX semaphores,
# which are less good for our purposes because they eat a file descriptor
# per backend per max_connection slot.
The extra FDs slow down launching of new backends (due to having to dup
all the postmaster's FDs for the semaphores) and if max_connections is
large they can take a pretty serious chunk out of your system-wide file
table, at worst max_connections squared.
Now maybe FreeBSD is different enough from OSX that these are not problems
for you, but I'm dubious.
Have you got unnamed POSIX semaphores, and if so have you tried that
variant?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom, thanks for looking at it so promptly. I am adding kib@ into the
discussion. Perhaps he would comment on the SYSV vs. POSIX in FreeBSD and
named vs. unnamed.
As far as I can tell, the sem_init(3) interface is present in the FreeBSD
10.3, so maybe we can use those instead?
-Max
On Tue, Jun 21, 2016 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
sobomax@freebsd.org writes:
However, according to the analysis done by Konstantin Belousov here
https://www.kib.kiev.ua/kib/pgsql_perf_v2.0.pdf there is at the veryleast
some performance benefit on using POSIX semaphones instead of SYSV
semaphones in the PG running on FreeBSD host.I wonder how thorough that performance testing was. The reason that the
named-POSIX-semaphore code exists is that it used to be the only kind of
semaphore available on ancient OS X versions. But we got rid of that as
soon as we could, for the reason explained in template/darwin:# Select appropriate semaphore support. Darwin 6.0 (Mac OS X 10.2) and up
# support System V semaphores; before that we have to use POSIX semaphores,
# which are less good for our purposes because they eat a file descriptor
# per backend per max_connection slot.The extra FDs slow down launching of new backends (due to having to dup
all the postmaster's FDs for the semaphores) and if max_connections is
large they can take a pretty serious chunk out of your system-wide file
table, at worst max_connections squared.Now maybe FreeBSD is different enough from OSX that these are not problems
for you, but I'm dubious.Have you got unnamed POSIX semaphores, and if so have you tried that
variant?regards, tom lane
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, thanks for looking at it so promptly. I am adding kib@ into the
discussion. Perhaps he would comment on the SYSV vs. POSIX in FreeBSD and
named vs. unnamed.
BTW, I trawled our archives and found this thread concerning the switch
from POSIX to SYSV on OS X:
/messages/by-id/3830CBEB-F8CE-4EBC-BE16-A415E78A4CBC@apple.com
I'm not sure what you were using to decide that POSIX semaphores were
okay, but the points in that thread about pgbench not being a very
good test case remain relevant.
As far as I can tell, the sem_init(3) interface is present in the FreeBSD
10.3, so maybe we can use those instead?
If that seems like a competitive alternative for you, it'd be nice to have
a platform where we use unnamed POSIX semaphores by default. I'm a little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Jun 21, 2016 at 04:36:02PM -0400, Tom Lane wrote:
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, thanks for looking at it so promptly. I am adding kib@ into the
discussion. Perhaps he would comment on the SYSV vs. POSIX in FreeBSD and
named vs. unnamed.BTW, I trawled our archives and found this thread concerning the switch
from POSIX to SYSV on OS X:/messages/by-id/3830CBEB-F8CE-4EBC-BE16-A415E78A4CBC@apple.com
I'm not sure what you were using to decide that POSIX semaphores were
okay, but the points in that thread about pgbench not being a very
good test case remain relevant.As far as I can tell, the sem_init(3) interface is present in the FreeBSD
10.3, so maybe we can use those instead?If that seems like a competitive alternative for you, it'd be nice to have
a platform where we use unnamed POSIX semaphores by default. I'm a little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.
On FreeBSD, there is no practical difference in the resource consumption
for named vs. unnamed semaphore. I mean that after sem_open(3) call, an
open file descriptor is not kept in the process fd table. The semaphore
is represented by the mmaped page, libc+kernel operate solely on the
page content and use umtx(2) to implement counted semaphore.
In other words, no, there is no additional overhead of starting
connection when using either named or unnamed (sem_init(3)) POSIX
semaphores on FreeBSD, and there is no any open files overhead.
That said, the problem with the SysV semaphores is that API allows
operations on arbitrary sets of the semaphores. Unless some unordinary
and complex measures are taken, implementation has to use global
internal lock to synchronize semop(2). This is what I noted in the
paper.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Konstantin, would it be too much to ask to start running your tests using
unnamed semaphores instead for your performance work? As far as I properly
understand what Tom said, named code was kinda one-off workaround for some
specific ancient version of Darwin not used by any other platform that the
PG cares about, so it might rot and/or get nuked eventually. Therefore, we
might get better chance to get our changes accepted into the PostgreSQL if
we use unnamed option. And they are not using "named" part anyway for
anything functionally important, so unnamed POSIX semaphore is naturally
the best primitive to use. This might also stir some interests among other
OSes to switch to that. Thanks!
-Max
On Wed, Jun 22, 2016 at 3:00 AM, Konstantin Belousov <kostikbel@gmail.com>
wrote:
Show quoted text
On Tue, Jun 21, 2016 at 04:36:02PM -0400, Tom Lane wrote:
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, thanks for looking at it so promptly. I am adding kib@ into the
discussion. Perhaps he would comment on the SYSV vs. POSIX in FreeBSDand
named vs. unnamed.
BTW, I trawled our archives and found this thread concerning the switch
from POSIX to SYSV on OS X:/messages/by-id/3830CBEB-F8CE-4EBC-BE16-A415E78A4CBC@apple.com
I'm not sure what you were using to decide that POSIX semaphores were
okay, but the points in that thread about pgbench not being a very
good test case remain relevant.As far as I can tell, the sem_init(3) interface is present in the
FreeBSD
10.3, so maybe we can use those instead?
If that seems like a competitive alternative for you, it'd be nice to
have
a platform where we use unnamed POSIX semaphores by default. I'm a
little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.On FreeBSD, there is no practical difference in the resource consumption
for named vs. unnamed semaphore. I mean that after sem_open(3) call, an
open file descriptor is not kept in the process fd table. The semaphore
is represented by the mmaped page, libc+kernel operate solely on the
page content and use umtx(2) to implement counted semaphore.In other words, no, there is no additional overhead of starting
connection when using either named or unnamed (sem_init(3)) POSIX
semaphores on FreeBSD, and there is no any open files overhead.That said, the problem with the SysV semaphores is that API allows
operations on arbitrary sets of the semaphores. Unless some unordinary
and complex measures are taken, implementation has to use global
internal lock to synchronize semop(2). This is what I noted in the
paper.
Konstantin Belousov <kostikbel@gmail.com> writes:
On Tue, Jun 21, 2016 at 04:36:02PM -0400, Tom Lane wrote:
If that seems like a competitive alternative for you, it'd be nice to have
a platform where we use unnamed POSIX semaphores by default. I'm a little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.
On FreeBSD, there is no practical difference in the resource consumption
for named vs. unnamed semaphore. I mean that after sem_open(3) call, an
open file descriptor is not kept in the process fd table. The semaphore
is represented by the mmaped page, libc+kernel operate solely on the
page content and use umtx(2) to implement counted semaphore.
Is there any kernel-side resource at all? The thing that concerns me
about the POSIX APIs is that it's not very clear whether anything gets
left behind if the database crashes. The Linux man page for sem_destroy
says
An unnamed semaphore should be destroyed with sem_destroy() before the
memory in which it is located is deallocated. Failure to do this can
result in resource leaks on some implementations.
and while they don't say that their own implementation has such a problem,
it's worrisome. We go to some lengths to ensure that we can recycle SysV
semaphores after a crash, but there's no equivalent logic in the POSIX
semaphore code, and I don't see how it would even be possible to identify
leftover "unnamed" semaphores.
That said, the problem with the SysV semaphores is that API allows
operations on arbitrary sets of the semaphores. Unless some unordinary
and complex measures are taken, implementation has to use global
internal lock to synchronize semop(2). This is what I noted in the
paper.
It's certainly true that semop(2) is more complicated than we need.
But in practice, we only call semop(2) when we need to sleep, or to
awaken a sleeping process, so I'm not sure that performance of it
matters a lot to us.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom, on the related note on merits of SYSV semaphores vs. POSIX the
handling of SYSV semaphore shortage in PG is, uggh, awful. The whole server
crashes (abort()s), bumping into that on production box is not fun at all
and it's particularly easy since the resource is severely constrained by
default. Ideally it should just deny the particular connection request. I
don't know if it's also true for POSIX primitives, but at least those are
by design more abundant. This is experienced with fairly recent PG 9.1,
we've been dealing with some of those crashes just last month. Maybe
something you guys need to consider for improvement, if you have not
already.
-Max
On Tue, Jun 21, 2016 at 1:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, thanks for looking at it so promptly. I am adding kib@ into the
discussion. Perhaps he would comment on the SYSV vs. POSIX in FreeBSD and
named vs. unnamed.BTW, I trawled our archives and found this thread concerning the switch
from POSIX to SYSV on OS X:/messages/by-id/3830CBEB-F8CE-4EBC-BE16-A415E78A4CBC@apple.com
I'm not sure what you were using to decide that POSIX semaphores were
okay, but the points in that thread about pgbench not being a very
good test case remain relevant.As far as I can tell, the sem_init(3) interface is present in the FreeBSD
10.3, so maybe we can use those instead?If that seems like a competitive alternative for you, it'd be nice to have
a platform where we use unnamed POSIX semaphores by default. I'm a little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.regards, tom lane
On Wed, Jun 22, 2016 at 10:48:50AM -0400, Tom Lane wrote:
Konstantin Belousov <kostikbel@gmail.com> writes:
On Tue, Jun 21, 2016 at 04:36:02PM -0400, Tom Lane wrote:
If that seems like a competitive alternative for you, it'd be nice to have
a platform where we use unnamed POSIX semaphores by default. I'm a little
worried about whether that code has suffered bit-rot, since it's been
sitting there basically unused for so long.On FreeBSD, there is no practical difference in the resource consumption
for named vs. unnamed semaphore. I mean that after sem_open(3) call, an
open file descriptor is not kept in the process fd table. The semaphore
is represented by the mmaped page, libc+kernel operate solely on the
page content and use umtx(2) to implement counted semaphore.Is there any kernel-side resource at all? The thing that concerns me
about the POSIX APIs is that it's not very clear whether anything gets
left behind if the database crashes. The Linux man page for sem_destroy
saysAn unnamed semaphore should be destroyed with sem_destroy() before the
memory in which it is located is deallocated. Failure to do this can
result in resource leaks on some implementations.and while they don't say that their own implementation has such a problem,
it's worrisome. We go to some lengths to ensure that we can recycle SysV
semaphores after a crash, but there's no equivalent logic in the POSIX
semaphore code, and I don't see how it would even be possible to identify
leftover "unnamed" semaphores.
On FreeBSD, it is only a memory page which is mmaped into all
processes-consumers of the unnamed semaphore. Of course, if the process
is blocked on semaphore, there is some bookkeeping done in kernel so
that post would find all waiters. But it is lightweight and automatically
released on wakeup. In other words, there is nothing to worry about
WRT cleanup after kill of unnamed semaphore consumers. Same for named,
but there the file is left around.
That said, the problem with the SysV semaphores is that API allows
operations on arbitrary sets of the semaphores. Unless some unordinary
and complex measures are taken, implementation has to use global
internal lock to synchronize semop(2). This is what I noted in the
paper.It's certainly true that semop(2) is more complicated than we need.
But in practice, we only call semop(2) when we need to sleep, or to
awaken a sleeping process, so I'm not sure that performance of it
matters a lot to us.
Issue is that the sleeps and wakeups on SysV semaphores do not scale,
at least on FreeBSD.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, on the related note on merits of SYSV semaphores vs. POSIX the
handling of SYSV semaphore shortage in PG is, uggh, awful. The whole server
crashes (abort()s), bumping into that on production box is not fun at all
and it's particularly easy since the resource is severely constrained by
default. Ideally it should just deny the particular connection request.
This seems like nonsense, because those are acquired once at postmaster
startup, not per connection. You will need to decrease max_connections
to start successfully in a resource-constrained system, but the same is
true of other resource limits.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom, my diagnosis may be a nonsense, but the crash is real.
Jun 5 21:47:38 sippy postgres[3744]: [2-1] PANIC: semop(id=65608) failed:
Invalid argument
Jun 5 21:47:38 sippy postgres[3743]: [2-1] PANIC: semop(id=65608) failed:
Invalid argument
Jun 5 21:47:39 sippy postgres[3725]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:39 sippy postgres[3742]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3550]: [2-1] PANIC: semop(id=65611) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3664]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3667]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3663]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3666]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 21:47:42 sippy postgres[3665]: [2-1] PANIC: semop(id=65609) failed:
Invalid argument
Jun 5 22:30:56 sippy postgres[3632]: [2-1] PANIC: semop(id=65610) failed:
Invalid argument
Jun 5 22:30:56 sippy postgres[3633]: [2-1] PANIC: semop(id=65610) failed:
Invalid argument
From the man semopt:
[EINVAL] No semaphore set corresponds to semid, or the
process
would exceed the system-defined limit for the number
of per-process SEM_UNDO structures.
AFAIK we've been hitting the second cause there. This is with FreeBSD 10.3
and postgresql 9.2.16 specifically. We've also seen this with 9.1 I think,
but I am not 100% sure. The specific limit that got exceeded
was kern.ipc.semmnu.
I can probably dig some stack traces, although they might be cleared out
now. Now looking at it again, I think you is probably right this is not
happening on the primitive creation time, but during its actual use. Which
may make graceful handling tricky if possible at all, still "good to have"
from my PG user's point of view.
Nevertheless, it just makes using POSIX primitives even more attractive
IMHO.
On Wed, Jun 22, 2016 at 8:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Sobolev <sobomax@freebsd.org> writes:
Tom, on the related note on merits of SYSV semaphores vs. POSIX the
handling of SYSV semaphore shortage in PG is, uggh, awful. The wholeserver
crashes (abort()s), bumping into that on production box is not fun at all
and it's particularly easy since the resource is severely constrained by
default. Ideally it should just deny the particular connection request.This seems like nonsense, because those are acquired once at postmaster
startup, not per connection. You will need to decrease max_connections
to start successfully in a resource-constrained system, but the same is
true of other resource limits.regards, tom lane
--
Maksym Sobolyev
Sippy Software, Inc.
Internet Telephony (VoIP) Experts
Tel (Canada): +1-778-783-0474
Tel (Toll-Free): +1-855-747-7779
Fax: +1-866-857-6942
Web: http://www.sippysoft.com
MSN: sales@sippysoft.com
Skype: SippySoft
Maxim Sobolev <sobomax@sippysoft.com> writes:
From the man semopt:
[EINVAL] No semaphore set corresponds to semid, or the process
would exceed the system-defined limit for the number
of per-process SEM_UNDO structures.
AFAIK we've been hitting the second cause there. This is with FreeBSD 10.3
and postgresql 9.2.16 specifically. We've also seen this with 9.1 I think,
but I am not 100% sure. The specific limit that got exceeded
was kern.ipc.semmnu.
We never ask semop(2) for SEM_UNDO, so are you sure this isn't a kernel
bug? I've never heard of such a report on any other platform.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom,
Well, now thinking about it, no I am certainly not sure about the root
cause. Mind you, until recently I had almost 0 knowledge about use of those
by postgres and, thanks God, I never had to use SYSV IPC for anything we
develop here. We've been increasing those limits to reasonably high values
for many years on kernels that we ship just to make PG run happily. And so
except of those sporadic leftover leakage of semaphores, that we've
workaround with ipcrm, it did not bother us much. (In case it's something
you want to look into that leakage seems to be happening when postgres
crashed upon write failure when disk space runs out, last time I've seen it
happening was this spring, so it must be some of the recent revisions too)
Back to those resource shortage crashes, in fact now that you are asking,
it's certainly possibly that something went south in the kernel. We also
done OS upgrade from 10.1 to 10.3 at the same time along with minor version
of postgres refresh when those crashes started to happen. But you are
always blame the software part first, so we did and in our case
bumping the kern.ipc.semmnu
did solve it for us, so we closed the case and moved on. I might look
through the changes in the semop() between 10.1 and 10.3 to see where that
EINVAL might be coming from. However, since we are considering switching to
(un)named POSIX primitives I don't feel a strong urge to do so.
But again, on a general note this kinda underlines the fact that sysv ipc
might be seen as somewhat legacy interface by the FreeBSD kernel people.
(disclamer, this is all IMHO, not talking on behalf of the freebsd project
or any part of it)
Some of it is related to the API design flaws that Konstantin alluded to,
some of it to the fact that as far as I understand, it's separate big chunk
of kernel code created decades ago by people who might no longer be
actively involved with the project and designed to run in vastly different
hardware and software environment. Nobody really owns it and it's not easy
to regression test. In my own limited experience if not for postgresql, we
would not even have that SYSV IPC enabled in our production kernels. And we
use some other 300+ opensource packages in our product, so sample is quite
representative I think.
On the other hand, as far as I understood from Konstantin's explanations,
POSIX primitives share most of the code with pthread library and as such
it's mostly modern code, well maintained, production and regression tested
and continuously optimized to run on modern things.
The point here that I am trying to make, perhaps postgres developers need
to acknowledge that clinging to SYSV IPC these days as the only supported
choice for synchronization is like using say sbrk(2) or mmap(2) to manage
heap instead of malloc(3). Yes, I understand that "don't touch it if it's
not broken" principle is important and yes, there might be some valid cases
where you'd want to do sbrk() or mmap too, but I also don't see any
technical reasons for not making POSIX primitives a first-class citizen in
PG either.
-Max
On Wed, Jun 22, 2016 at 9:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Maxim Sobolev <sobomax@sippysoft.com> writes:
From the man semopt:
[EINVAL] No semaphore set corresponds to semid, or theprocess
would exceed the system-defined limit for the
number
of per-process SEM_UNDO structures.
AFAIK we've been hitting the second cause there. This is with FreeBSD
10.3
and postgresql 9.2.16 specifically. We've also seen this with 9.1 I
think,
but I am not 100% sure. The specific limit that got exceeded
was kern.ipc.semmnu.We never ask semop(2) for SEM_UNDO, so are you sure this isn't a kernel
bug? I've never heard of such a report on any other platform.regards, tom lane
On Wed, Jun 22, 2016 at 11:49:33AM -0700, Maxim Sobolev wrote:
Some of it is related to the API design flaws that Konstantin alluded to, some
of it to the fact that as far as I understand, it's separate big chunk of
kernel code created decades ago by people who might no longer be actively
involved with the project and designed to run in vastly different hardware and
software environment. Nobody really owns it and it's not easy to regression
test. In my own limited experience if not for postgresql, we would not even
have that SYSV IPC enabled in our production kernels. And we use some other
300+ opensource packages in our product, so sample is quite representative I
think.
FYI, databases were the primary users of SYSV IPC even back in the old
days, so I am not surprised we might still be one of the rare users. :-)
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs