problems on Solaris

Started by Andrew Dunstanover 10 years ago15 messages

andrew@dunslane.net

over 10 years ago

Buildfarm members casteroides and protosciurus have been having some
problems that seem puzzling. These animals both run on the same machine,
but with different compilers.

casteroides runs with the Sun Studio 12 compiler, and has twice in the
last 3 days demonstrated this error:

[5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq"
[5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30
[5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6
[5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:28] LOG: terminating any other active server processes

It's not constant - between the two failures was a success.

protociurus runs with gcc 3.4.3 and gets this error:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include -m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c
In file included from /usr/include/sys/vnode.h:47,
from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/kstat.h:439: error: syntax error before numeric constant
/usr/include/sys/kstat.h:463: error: syntax error before '}' token
/usr/include/sys/kstat.h:464: error: syntax error before '}' token
In file included from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t"
/usr/include/sys/vnode.h:107: error: syntax error before "nread"
/usr/include/sys/vnode.h:108: error: syntax error before "read_bytes"
/usr/include/sys/vnode.h:109: error: syntax error before "nwrite"
/usr/include/sys/vnode.h:110: error: syntax error before "write_bytes"
/usr/include/sys/vnode.h:111: error: syntax error before "nioctl"
/usr/include/sys/vnode.h:112: error: syntax error before "nsetfl"
/usr/include/sys/vnode.h:113: error: syntax error before "ngetattr"
/usr/include/sys/vnode.h:114: error: syntax error before "nsetattr"
/usr/include/sys/vnode.h:115: error: syntax error before "naccess"
/usr/include/sys/vnode.h:116: error: syntax error before "nlookup"
/usr/include/sys/vnode.h:117: error: syntax error before "ncreate"
/usr/include/sys/vnode.h:118: error: syntax error before "nremove"
/usr/include/sys/vnode.h:119: error: syntax error before "nlink"
/usr/include/sys/vnode.h:120: error: syntax error before "nrename"
/usr/include/sys/vnode.h:121: error: syntax error before "nmkdir"
/usr/include/sys/vnode.h:122: error: syntax error before "nrmdir"
/usr/include/sys/vnode.h:123: error: syntax error before "nreaddir"
/usr/include/sys/vnode.h:124: error: syntax error before "readdir_bytes"
/usr/include/sys/vnode.h:125: error: syntax error before "nsymlink"
/usr/include/sys/vnode.h:126: error: syntax error before "nreadlink"
/usr/include/sys/vnode.h:127: error: syntax error before "nfsync"
/usr/include/sys/vnode.h:128: error: syntax error before "ninactive"
/usr/include/sys/vnode.h:129: error: syntax error before "nfid"
/usr/include/sys/vnode.h:130: error: syntax error before "nrwlock"
/usr/include/sys/vnode.h:131: error: syntax error before "nrwunlock"
/usr/include/sys/vnode.h:132: error: syntax error before "nseek"
/usr/include/sys/vnode.h:133: error: syntax error before "ncmp"
/usr/include/sys/vnode.h:134: error: syntax error before "nfrlock"
/usr/include/sys/vnode.h:135: error: syntax error before "nspace"
/usr/include/sys/vnode.h:136: error: syntax error before "nrealvp"
/usr/include/sys/vnode.h:137: error: syntax error before "ngetpage"
/usr/include/sys/vnode.h:138: error: syntax error before "nputpage"
/usr/include/sys/vnode.h:139: error: syntax error before "nmap"
/usr/include/sys/vnode.h:140: error: syntax error before "naddmap"
/usr/include/sys/vnode.h:141: error: syntax error before "ndelmap"
/usr/include/sys/vnode.h:142: error: syntax error before "npoll"
/usr/include/sys/vnode.h:143: error: syntax error before "ndump"
/usr/include/sys/vnode.h:144: error: syntax error before "npathconf"
/usr/include/sys/vnode.h:145: error: syntax error before "npageio"
/usr/include/sys/vnode.h:146: error: syntax error before "ndumpctl"
/usr/include/sys/vnode.h:147: error: syntax error before "ndispose"
/usr/include/sys/vnode.h:148: error: syntax error before "nsetsecattr"
/usr/include/sys/vnode.h:149: error: syntax error before "ngetsecattr"
/usr/include/sys/vnode.h:150: error: syntax error before "nshrlock"
/usr/include/sys/vnode.h:151: error: syntax error before "nvnevent"
gmake: *** [specparse.o] Error 1

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Andrew Dunstan (#1)

Re: problems on Solaris

On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:

Buildfarm members casteroides and protosciurus have been having some
problems that seem puzzling. These animals both run on the same machine, but
with different compilers.

casteroides runs with the Sun Studio 12 compiler, and has twice in the last
3 days demonstrated this error:

[5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq"
[5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30
[5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6
[5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:28] LOG: terminating any other active server processes

It's not constant - between the two failures was a success.

That's indeed rather odd. For one the relevant code does nothing but
lock/unlock a spinlock. For another, there's been no recent change to
this and casteroides has been running happily for a long time.

protociurus runs with gcc 3.4.3 and gets this error:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include -m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c
In file included from /usr/include/sys/vnode.h:47,
from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/kstat.h:439: error: syntax error before numeric constant
/usr/include/sys/kstat.h:463: error: syntax error before '}' token
/usr/include/sys/kstat.h:464: error: syntax error before '}' token
In file included from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t"

I'd noticed this one as well. This sounds like a installation problem,
not really ours. Dave, any chance you could look into this, or give
somebody an account to test what's up?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

over 10 years ago

In reply to: Andres Freund (#2)

Re: problems on Solaris

On 05/24/2015 08:07 PM, Andres Freund wrote:

On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:

Buildfarm members casteroides and protosciurus have been having some
problems that seem puzzling. These animals both run on the same machine, but
with different compilers.

casteroides runs with the Sun Studio 12 compiler, and has twice in the last
3 days demonstrated this error:

[5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq"
[5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30
[5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6
[5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:28] LOG: terminating any other active server processes

It's not constant - between the two failures was a success.

That's indeed rather odd. For one the relevant code does nothing but
lock/unlock a spinlock. For another, there's been no recent change to
this and casteroides has been running happily for a long time.

Yes, but it wasn't running these tests until a few days ago when its
buildfarm software was upgraded.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Andrew Dunstan (#3)

Re: problems on Solaris

On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:

Yes, but it wasn't running these tests until a few days ago when its
buildfarm software was upgraded.

But barriers are used in other places too...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

over 10 years ago

In reply to: Andres Freund (#4)

Re: problems on Solaris

On 05/25/2015 03:17 AM, Andres Freund wrote:

On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:

Yes, but it wasn't running these tests until a few days ago when its
buildfarm software was upgraded.

But barriers are used in other places too...

fwiw: spoonbill just failed in the same part of the regression tests
(and it is a Sparc64 box though not running solaris):

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Stefan Kaltenbrunner (#5)

Re: problems on Solaris

On 2015-05-25 09:12:35 +0200, Stefan Kaltenbrunner wrote:

On 05/25/2015 03:17 AM, Andres Freund wrote:

On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:

Yes, but it wasn't running these tests until a few days ago when its
buildfarm software was upgraded.

But barriers are used in other places too...

fwiw: spoonbill just failed in the same part of the regression tests
(and it is a Sparc64 box though not running solaris):

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07

With a quite different error though: PANIC: ERRORDATA_STACK_SIZE exceeded

Hm. So we have a *occasional* stack size exceeded failure and an
occasional spinlock error in test_shm_mq. I'm inclined to think that
this is a shm_mq problem, and not a more general locking problem - it
seems likely, but not guaranteed, that that'd have materialized
elsewhere.

Robert: IIRC there was some problems with shm_mq tests being stuck
before, right?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dave Page

dpage@pgadmin.org

over 10 years ago

In reply to: Andres Freund (#2)

Re: problems on Solaris

On Mon, May 25, 2015 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:

Buildfarm members casteroides and protosciurus have been having some
problems that seem puzzling. These animals both run on the same machine, but
with different compilers.

casteroides runs with the Sun Studio 12 compiler, and has twice in the last
3 days demonstrated this error:

[5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq"
[5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30
[5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6
[5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3);
[5561ce0c.51b7:28] LOG: terminating any other active server processes

It's not constant - between the two failures was a success.

That's indeed rather odd. For one the relevant code does nothing but
lock/unlock a spinlock. For another, there's been no recent change to
this and casteroides has been running happily for a long time.

protociurus runs with gcc 3.4.3 and gets this error:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include -m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c
In file included from /usr/include/sys/vnode.h:47,
from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/kstat.h:439: error: syntax error before numeric constant
/usr/include/sys/kstat.h:463: error: syntax error before '}' token
/usr/include/sys/kstat.h:464: error: syntax error before '}' token
In file included from /usr/include/sys/stream.h:22,
from /usr/include/netinet/in.h:66,
from /usr/include/netdb.h:98,
from ../../../src/include/port.h:17,
from ../../../src/include/c.h:1114,
from ../../../src/include/postgres_fe.h:25,
from specparse.y:13:
/usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t"

I'd noticed this one as well. This sounds like a installation problem,
not really ours. Dave, any chance you could look into this, or give
somebody an account to test what's up?

I'm not going to be able to look at this, at least this week. I can
give someone on the EDB team access - Robert; can one of your guys
take a look?

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Andres Freund (#6)

Re: problems on Solaris

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. So we have a *occasional* stack size exceeded failure and an
occasional spinlock error in test_shm_mq. I'm inclined to think that
this is a shm_mq problem, and not a more general locking problem - it
seems likely, but not guaranteed, that that'd have materialized
elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant. Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it. Just
then, we receive a signal. Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it. Oops.

Robert: IIRC there was some problems with shm_mq tests being stuck
before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rhaas@postgresql.org>
Date: Sat Oct 4 21:25:41 2014 -0400

Eliminate one background-worker-related flag variable.

Teach sigusr1_handler() to use the same test for whether a worker
might need to be started as ServerLoop(). Aside from being perhaps
a bit simpler, this prevents a potentially-unbounded delay when
starting a background worker. On some platforms, select() doesn't
return when interrupted by a signal, but is instead restarted,
including a reset of the timeout to the originally-requested value.
If signals arrive often enough, but no connection requests arrive,
sigusr1_handler() will be executed repeatedly, but the body of
ServerLoop() won't be reached. This change ensures that, even in
that case, background workers will eventually get launched.

This is far from a perfect fix; really, we need select() to return
control to ServerLoop() after an interrupt, either via the self-pipe
trick or some other mechanism. But that's going to require more
work and discussion, so let's do this for now to at least mitigate
the damage.

Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster. To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly. This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Robert Haas (#8)

Re: problems on Solaris

On 2015-05-27 15:39:14 -0400, Robert Haas wrote:

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. So we have a *occasional* stack size exceeded failure and an
occasional spinlock error in test_shm_mq. I'm inclined to think that
this is a shm_mq problem, and not a more general locking problem - it
seems likely, but not guaranteed, that that'd have materialized
elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant. Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it. Just
then, we receive a signal. Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it. Oops.

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Andres Freund (#9)

Re: problems on Solaris

On Wed, May 27, 2015 at 6:55 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-05-27 15:39:14 -0400, Robert Haas wrote:

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. So we have a *occasional* stack size exceeded failure and an
occasional spinlock error in test_shm_mq. I'm inclined to think that
this is a shm_mq problem, and not a more general locking problem - it
seems likely, but not guaranteed, that that'd have materialized
elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant. Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it. Just
then, we receive a signal. Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it. Oops.

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Robert Haas (#10)

Re: problems on Solaris

On 2015-05-27 21:23:34 -0400, Robert Haas wrote:

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

It's not necessarily true that all system calls are effective
barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
local error checking. It might be that the process existance check
includes a lock that's sufficient, but I would not like to rely on
it. Sending an actual signal probably would be, but has the potential of
disrupting postmaster progress.

I think we should just bite the bullet and require a barrier
implementation for all architectures that have spinlock support. That
should be fairly straightforward, even though distinctly unpleasurable,
exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
doesn't have the issue that spinlocks have) for --disable-spinlock
platforms.

If people agree with that way forward, I'll go through the
platforms. The biggest one missing is probably solaris with sun's
compiler.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Andres Freund (#11)

Re: problems on Solaris

On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-05-27 21:23:34 -0400, Robert Haas wrote:

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

It's not necessarily true that all system calls are effective
barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
local error checking. It might be that the process existance check
includes a lock that's sufficient, but I would not like to rely on
it. Sending an actual signal probably would be, but has the potential of
disrupting postmaster progress.

So pick a better system call?

I think we should just bite the bullet and require a barrier
implementation for all architectures that have spinlock support. That
should be fairly straightforward, even though distinctly unpleasurable,
exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
doesn't have the issue that spinlocks have) for --disable-spinlock
platforms.

Like maybe this.

If people agree with that way forward, I'll go through the
platforms. The biggest one missing is probably solaris with sun's
compiler.

Certainly, having real barriers everywhere would be great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Robert Haas (#12)

Re: problems on Solaris

On 2015-05-31 08:00:44 -0400, Robert Haas wrote:

On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-05-27 21:23:34 -0400, Robert Haas wrote:

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

It's not necessarily true that all system calls are effective
barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
local error checking. It might be that the process existance check
includes a lock that's sufficient, but I would not like to rely on
it. Sending an actual signal probably would be, but has the potential of
disrupting postmaster progress.

So pick a better system call?

It's not yet entirely clear what that'd be unfortunately. Maybe we could
use waitpid(PostmasterPid, status, WNOHANG) - afaics that should work.

I think we should just bite the bullet and require a barrier
implementation for all architectures that have spinlock support. That
should be fairly straightforward, even though distinctly unpleasurable,
exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
doesn't have the issue that spinlocks have) for --disable-spinlock
platforms.

Like maybe this.

On second thought they're unfortunately not entirely suitable. While
we've had used semaphores in signal indirectly for a long while
(e.g. deadlock detector, sinval code etc), they're formally not
guaranteed to be signal safe.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Andres Freund (#11)

Re: problems on Solaris

On 2015-05-31 01:09:18 +0200, Andres Freund wrote:

On 2015-05-27 21:23:34 -0400, Robert Haas wrote:

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

It's not necessarily true that all system calls are effective
barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
local error checking. It might be that the process existance check
includes a lock that's sufficient, but I would not like to rely on
it. Sending an actual signal probably would be, but has the potential of
disrupting postmaster progress.

I thought about various other syscalls we could use, and your proposal
seems to be least worst. My idea of using waitpid() falls short because
it only works for child processes. I think the kind of systems that we
don't have barriers on, are unlikely to use complex stuff like RCU to
manage access to process hierarchies.

I reproduced the 'stuck' issue on x86 by #ifdef'ing out barrier support
- about 50% of the time test_shm_mq gets stuck. Replacing it with
kill(PostmasterPid, 0) "works". Unless somebody protests soon that's
what I'm going to commit. It surely is better than easily reproducible
hangs.

I'm wondering wether we should add a #warning to atomic.c if either the
fallback memory or compiler barrier is used? Might be annoying to people
using -Werror, but I doubt that's possible anyway on such old systems.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Andres Freund (#14)

Re: problems on Solaris

On Wed, Jun 24, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote:

I'm wondering wether we should add a #warning to atomic.c if either the
fallback memory or compiler barrier is used? Might be annoying to people
using -Werror, but I doubt that's possible anyway on such old systems.

#warning isn't totally portable, so I think it might be better not to
do that. Yeah, it'll work in a lot of places, but the sorts of
obscure systems where the fallbacks are used are also more likely to
have funky compilers that just barf on the directive outright.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers