shm_mq_set_sender() crash

Started by Tom Laneover 9 years ago6 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Latest from lorikeet:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27

TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 220)

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#1)
Re: shm_mq_set_sender() crash

On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Latest from lorikeet:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&amp;dt=2016-08-26%2008%3A37%3A27

TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 220)

Do you think, it is due to some recent change or we are just seeing
now as it could be timing specific issue?

So here what seems to be happening is that during worker startup, we
are trying to set the sender for a shared memory queue and the same is
already set. Now, one theoretical possibility of the same could be
that the two workers get the same ParallelWorkerNumber which is then
used to access the shm queue (refer
ParallelWorkerMain/ExecParallelGetReceiver). We are setting the
ParallelWorkerNumber in below code which seems to be doing what it is
suppose to do:

LaunchParallelWorkers()
{
..
for (i = 0; i < pcxt->nworkers; ++i)
{
memcpy(worker.bgw_extra, &i, sizeof(int));
if (!any_registrations_failed &&
RegisterDynamicBackgroundWorker(&worker,
&pcxt->worker[i].bgwhandle))
..
}

Can some reordering impact the above code?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: shm_mq_set_sender() crash

On Sat, Aug 27, 2016 at 3:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Latest from lorikeet:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&amp;dt=2016-08-26%2008%3A37%3A27

TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 220)

Do you think, it is due to some recent change or we are just seeing
now as it could be timing specific issue?

So here what seems to be happening is that during worker startup, we
are trying to set the sender for a shared memory queue and the same is
already set. Now, one theoretical possibility of the same could be
that the two workers get the same ParallelWorkerNumber which is then
used to access the shm queue (refer
ParallelWorkerMain/ExecParallelGetReceiver). We are setting the
ParallelWorkerNumber in below code which seems to be doing what it is
suppose to do:

LaunchParallelWorkers()
{
..
for (i = 0; i < pcxt->nworkers; ++i)
{
memcpy(worker.bgw_extra, &i, sizeof(int));
if (!any_registrations_failed &&
RegisterDynamicBackgroundWorker(&worker,
&pcxt->worker[i].bgwhandle))
..
}

Can some reordering impact the above code?

I don't think so. Your guess that ParallelWorkerNumber is getting
messed up somehow seems like a good one, but I don't see anything
wrong with that code. There's actually a pretty long chain here.
That code copies the value of the local variable i into
worker.bgw_extra. Then, RegisterDynamicBackgroundWorker copies the
whole structure into shared memory. Then, running inside the
postmaster, BackgroundWorkerStateChange copies it into the postmaster
address space. But, since this is Windows, that copy doesn't actually
passed to the worker; instead, BackgroundWorkerEntry() copies the data
from shared memory into the new worker processes' MyBgworkerEntry.
Then BackgroundWorkerMain() copies the data from there to
ParallelWorkerNumber. In theory any of those places could be going
wrong somehow, though none of them can be completely busted because
they all work at least most of the time.

Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location. It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray. But unless we can reproduce
this I don't see how to manage that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: shm_mq_set_sender() crash

Robert Haas <robertmhaas@gmail.com> writes:

Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location. It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray. But unless we can reproduce
this I don't see how to manage that.

Is it worth replacing that Assert with a test-and-elog that would
print those values?

Given that we've seen only one such instance in the buildfarm, this
might've been just a cosmic ray bit-flip. So one part of me says
not to worry too much until we see it again. OTOH, if it is real
but rare, missing an opportunity to diagnose would be bad.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: shm_mq_set_sender() crash

On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location. It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray. But unless we can reproduce
this I don't see how to manage that.

Is it worth replacing that Assert with a test-and-elog that would
print those values?

Given that we've seen only one such instance in the buildfarm, this
might've been just a cosmic ray bit-flip. So one part of me says
not to worry too much until we see it again. OTOH, if it is real
but rare, missing an opportunity to diagnose would be bad.

I wonder if we could persuade somebody to run pgbench on a Windows
machine with a similar environment, at least some concurrency, and
force_parallel_mode=on. Assuming this is a generic
initialize-the-parallel-stuff bug and not something specific to a
particular query, that might well trigger it a lot quicker than
waiting for it to recur in the BF.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#5)
Re: shm_mq_set_sender() crash

On Thu, Sep 15, 2016 at 06:21:30PM -0400, Robert Haas wrote:

On Thu, Sep 15, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Of course, it's also possible that the ParallelWorkerNumber code is
entirely correct and something overwrote the null bytes that were
supposed to be found at that location. It would be very useful to see
(a) the value of ParallelWorkerNumber and (b) the contents of
vmq->mq_sender, and in particular whether that's actually a valid
pointer to a PGPROC in the ProcArray. But unless we can reproduce
this I don't see how to manage that.

Is it worth replacing that Assert with a test-and-elog that would
print those values?

Given that we've seen only one such instance in the buildfarm, this
might've been just a cosmic ray bit-flip. So one part of me says
not to worry too much until we see it again. OTOH, if it is real
but rare, missing an opportunity to diagnose would be bad.

I wonder if we could persuade somebody to run pgbench on a Windows
machine with a similar environment, at least some concurrency, and
force_parallel_mode=on. Assuming this is a generic
initialize-the-parallel-stuff bug and not something specific to a
particular query, that might well trigger it a lot quicker than
waiting for it to recur in the BF.

For the sake of this thread in the archives, the cause was almost surely a
Cygwin signal handling bug:

/messages/by-id/20170803034740.GA2641942@rfd.leadboat.com
https://marc.info/?t=150183296400001
https://marc.info/?t=150291861700011

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers