intermittent failures in Cygwin from select_parallel tests

Started by Andrew Dunstanalmost 9 years ago54 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:

2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 221)

I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) that would be helpful.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: intermittent failures in Cygwin from select_parallel tests

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:

2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 221)

I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) that would be helpful.

Well, this seems like it has to indicate an incorrect call of
shm_mq_set_sender. I have no great insights as to what might be causing
that, but I sure find it to be pretty unacceptable coding practice that
the call sites are not checking for failure returns from shm_toc_lookup.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#1)
Re: intermittent failures in Cygwin from select_parallel tests

On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:

2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", Line: 221)

I'll see if I can find out why, but if anyone has any ideas why this might be happening (started about 3 weeks ago) that would be helpful.

I don't *think* we've made any relevant code changes lately. The only
thing that I can see as looking at all relevant is
b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
like it can be to blame.

One thought is that the only places where shm_mq_set_sender() should
be getting invoked during the main regression tests are
ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
places using ParallelWorkerNumber to figure out what address to pass.
So if ParallelWorkerNumber were getting set to the same value in two
different parallel workers - e.g. because the postmaster went nuts and
launched two processes instead of only one - or if
ParallelWorkerNumber were not getting initialized at all or were
getting initialized to some completely bogus value, it could cause
this symptom.

What ought to be happening, if there are N workers launched by a
parallel query, is that ParallelWorkerNumber should be different in
every worker, over the range 0 to N-1. I think if I were you my first
step would be to verify that ParallelWorkerNumber is in that range in
the crashed backend, and if it is, my second step would be to add some
debugging code to ParallelWorkerMain() to print it out in every worker
that gets launched and make sure they're all in range and different.

All of the above might be going in the wrong direction entirely, but
it's the first thing that comes to mind for me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:

I don't *think* we've made any relevant code changes lately. The only
thing that I can see as looking at all relevant is
b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
like it can be to blame.

Yeah, I don't believe that either. That could have introduced a hard
failure (if something were relying on initializing a field before where
I put the memsets) but it's hard to see how it could produce an
intermittent and platform-specific one.

One thought is that the only places where shm_mq_set_sender() should
be getting invoked during the main regression tests are
ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
places using ParallelWorkerNumber to figure out what address to pass.
So if ParallelWorkerNumber were getting set to the same value in two
different parallel workers - e.g. because the postmaster went nuts and
launched two processes instead of only one - or if
ParallelWorkerNumber were not getting initialized at all or were
getting initialized to some completely bogus value, it could cause
this symptom.

Hmm. With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: intermittent failures in Cygwin from select_parallel tests

On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One thought is that the only places where shm_mq_set_sender() should
be getting invoked during the main regression tests are
ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
places using ParallelWorkerNumber to figure out what address to pass.
So if ParallelWorkerNumber were getting set to the same value in two
different parallel workers - e.g. because the postmaster went nuts and
launched two processes instead of only one - or if
ParallelWorkerNumber were not getting initialized at all or were
getting initialized to some completely bogus value, it could cause
this symptom.

Hmm. With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?

Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it. I mean, it seems like that commit could
make launching bgworkers faster, which could conceivably tickle some
heretofore-latent timing-related bug. But it wouldn't, IIUC, make the
first worker start any faster than before - it would just make them
more closely-spaced thereafter, and it's not very obvious how that
would cause a problem.

Another idea is that the commit in question is managing to corrupt
BackgroundWorkerList somehow. maybe_start_bgworkers() is using
slist_foreach_modify(), but previously it always returned after
calling do_start_bgworker, and now it doesn't. So if
do_start_bgworker() did something that could modify the list
structure, then perhaps maybe_start_bgworkers() would get confused. I
don't really think that this theory has any legs, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm. With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.
That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?

Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it.

Yeah, me either, but we're really theorizing in advance of the data here.
Andrew, could you apply the attached patch on lorikeet and run the
regression tests enough times to get a couple of failures? Then grepping
the postmaster log for 'parallel worker' should give you results like

2017-06-06 16:20:12.393 EDT [31216] LOG: starting PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.400 EDT [31216] LOG: stopping PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.406 EDT [31217] LOG: starting PID 31217, parallel worker for PID 31215, worker number 3
2017-06-06 16:20:12.406 EDT [31218] LOG: starting PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.406 EDT [31219] LOG: starting PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.406 EDT [31220] LOG: starting PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31218] LOG: stopping PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.412 EDT [31219] LOG: stopping PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.412 EDT [31220] LOG: stopping PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31217] LOG: stopping PID 31217, parallel worker for PID 31215, worker number 3
... etc etc ...

If it looks different from that in a crash case, we'll have something
to go on.

(I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
or so.)

regards, tom lane

Attachments:

log-parallel-worker-start-stop.patchtext/x-diff; charset=us-ascii; name=log-parallel-worker-start-stop.patchDownload+12-0
#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: intermittent failures in Cygwin from select_parallel tests

On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
or so.)

I don't mind adding it permanently, but I think that's too high.
Somebody running a lot of parallel queries could easily get enough
messages to drown out the stuff they actually want to see at that
level. I wouldn't object to DEBUG3 though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#5)
Re: intermittent failures in Cygwin from select_parallel tests

On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One thought is that the only places where shm_mq_set_sender() should
be getting invoked during the main regression tests are
ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
places using ParallelWorkerNumber to figure out what address to pass.
So if ParallelWorkerNumber were getting set to the same value in two
different parallel workers - e.g. because the postmaster went nuts and
launched two processes instead of only one - or if
ParallelWorkerNumber were not getting initialized at all or were
getting initialized to some completely bogus value, it could cause
this symptom.

Hmm. With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?

Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it. I mean, it seems like that commit could
make launching bgworkers faster, which could conceivably tickle some
heretofore-latent timing-related bug. But it wouldn't, IIUC, make the
first worker start any faster than before - it would just make them
more closely-spaced thereafter, and it's not very obvious how that
would cause a problem.

Another idea is that the commit in question is managing to corrupt
BackgroundWorkerList somehow.

I don't think so because this problem has been reported previously as
well [1]/messages/by-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1@dunslane.net[2]/messages/by-id/25861.1472215822@sss.pgh.pa.us even before the commit in question.

[1]: /messages/by-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1@dunslane.net
[2]: /messages/by-id/25861.1472215822@sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#8)
Re: intermittent failures in Cygwin from select_parallel tests

On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think so because this problem has been reported previously as
well [1][2] even before the commit in question.

[1] - /messages/by-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1@dunslane.net
[2] - /messages/by-id/25861.1472215822@sss.pgh.pa.us

Oh, good point. So this is a longstanding bug that has possibly just
gotten easier to hit.

I still think figuring out what's going on with the
ParallelWorkerNumbers is a good plan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think so because this problem has been reported previously as
well [1][2] even before the commit in question.

[1] - /messages/by-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1@dunslane.net
[2] - /messages/by-id/25861.1472215822@sss.pgh.pa.us

Oh, good point. So this is a longstanding bug that has possibly just
gotten easier to hit.

I still think figuring out what's going on with the
ParallelWorkerNumbers is a good plan.

Also, your suggestion in

/messages/by-id/CA+Tgmob29v0zASBNfgO1Mq9yJ7_TRoAjL=O+2rXS0gBZezv+rQ@mail.gmail.com

for a quicker path to reproducing it might still be valid. Although at
this point, it seems like something we've changed recently has made it
occur often enough in the buildfarm that repeating the standard regression
tests should be good enough.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: intermittent failures in Cygwin from select_parallel tests

Yesterday lorikeet failed the select_parallel test in a new way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&amp;dt=2017-06-13%2020%3A28%3A33

2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= 1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process was running: select stringu1::int2 from tenk1 where unique1 = 1;
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any other active server processes

So the first problem here is the lack of supporting information for the
'could not map' failure. AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED. It's far from clear why those cases should be
treated as a silent fail. It's even less clear why dsm_attach's early
exit cases don't produce any messages. But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

The subsequent assertion crash came from here:

/*
* If this is a parallel worker, check whether there are already too many
* parallel workers; if so, don't register another one. Our view of
* parallel_terminate_count may be slightly stale, but that doesn't really
* matter: we would have gotten the same result if we'd arrived here
* slightly earlier anyway. There's no help for it, either, since the
* postmaster must not take locks; a memory barrier wouldn't guarantee
* anything useful.
*/
if (parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
{
Assert(BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count <=
MAX_PARALLEL_WORKER_LIMIT);
LWLockRelease(BackgroundWorkerLock);
return false;
}

What I suspect happened is that parallel_register_count was less than
parallel_terminate_count; since those counters are declared as uint32,
the negative difference would have been treated as a large unsigned value,
triggering the assertion. It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker. It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again. I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
between incrementing parallel_terminate_count and marking the slot free.
Maybe it doesn't need one, but since there is one in
BackgroundWorkerStateChange, it looks weird to not have it here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: intermittent failures in Cygwin from select_parallel tests

On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So the first problem here is the lack of supporting information for the
'could not map' failure.

Hmm. I think I believed at the time I wrote dsm_attach() that
somebody might want to try to soldier on after failing to map a DSM,
but that doesn't seem very likely any more. That theory is supported
by this comment:

/*
* If we didn't find the handle we're looking for in the control segment,
* it probably means that everyone else who had it mapped, including the
* original creator, died before we got to this point. It's up to the
* caller to decide what to do about that.
*/

But in practice, every current caller handles a failure here by bailing out.

AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED. It's far from clear why those cases should be
treated as a silent fail.

There's a good reason for that, though. See
419113dfdc4c729f6c763cc30a9b02ee68a7da94.

It's even less clear why dsm_attach's early
exit cases don't produce any messages. But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

I don't know what you mean by this. The function only has one
early-exit case, the comment for which I quoted above.

The subsequent assertion crash came from here:

/*
* If this is a parallel worker, check whether there are already too many
* parallel workers; if so, don't register another one. Our view of
* parallel_terminate_count may be slightly stale, but that doesn't really
* matter: we would have gotten the same result if we'd arrived here
* slightly earlier anyway. There's no help for it, either, since the
* postmaster must not take locks; a memory barrier wouldn't guarantee
* anything useful.
*/
if (parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
{
Assert(BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count <=
MAX_PARALLEL_WORKER_LIMIT);
LWLockRelease(BackgroundWorkerLock);
return false;
}

What I suspect happened is that parallel_register_count was less than
parallel_terminate_count; since those counters are declared as uint32,
the negative difference would have been treated as a large unsigned value,
triggering the assertion.

Right, and that's exactly the point of having that assertion.

It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker. It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again. I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

Yeah -- if that can happen, it's definitely a bug.

I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
between incrementing parallel_terminate_count and marking the slot free.
Maybe it doesn't need one, but since there is one in
BackgroundWorkerStateChange, it looks weird to not have it here.

I noticed that the other day and thought about mentioning it, but
didn't quite muster up the energy. It seems unlikely to me to be the
cause of any of the problems we're seeing, but I think it can't hurt
to fix the omission.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So the first problem here is the lack of supporting information for the
'could not map' failure.

Hmm. I think I believed at the time I wrote dsm_attach() that
somebody might want to try to soldier on after failing to map a DSM,
but that doesn't seem very likely any more.

Well, if they do, they shouldn't be passing elevel == ERROR.

AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED. It's far from clear why those cases should be
treated as a silent fail.

There's a good reason for that, though. See
419113dfdc4c729f6c763cc30a9b02ee68a7da94.

But surely the silent treatment should only apply to DSM_OP_CREATE?
We're not going to retry anything else.

It's even less clear why dsm_attach's early
exit cases don't produce any messages. But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

I don't know what you mean by this. The function only has one
early-exit case, the comment for which I quoted above.

OK, s/cases/case/, but the problem remains. We don't know what happened.
We cannot have more than one case in this code where nothing gets logged.

It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker. It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again. I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

Yeah -- if that can happen, it's definitely a bug.

My first thought about fixing it is that we should remove that code from
BackgroundWorkerStateChange altogether. The parallel_terminate_count
increment should happen in, and only in, ForgetBackgroundWorker. There
seems little reason to risk bugs by trying to do it a bit earlier.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: intermittent failures in Cygwin from select_parallel tests

I wrote:

But surely the silent treatment should only apply to DSM_OP_CREATE?

Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.

The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to attach to. But how could that happen?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#14)
Re: intermittent failures in Cygwin from select_parallel tests

On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

But surely the silent treatment should only apply to DSM_OP_CREATE?

Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.

The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to attach to.

Yes, I also think so.

But how could that happen?

I could think of a couple of reasons due to which it can happen. (a)
the value of segment handle passed by the master backend to worker
backend somehow got messed up. (b) all other workers along with master
backend exited before one of the workers try to attach. (c) the master
backend has not actually created any dsm segment (like when max
segments have reached) but still invoked some workers. (d) some
corner case bug in dsm code due to which it can't attach to a valid
segment handle.

Now, of these, I have checked that (c) can't happen because we ensure
that if the segment is not created than we make workers as zero. I
think (b) shouldn't happen because we wait for all workers to exit
before the query is finished. Now, I think (a) and (d) are slightly
related and I have looked around in the relevant code but didn't find
any obvious problem, however, it seems to me that it might be
something which happens on Cygwin environment differently. For
example, I think the value of seg->handle can be different from what
we expect in dsm_create. Basically, random returns long and we are
storing it in dsm_handle (uint32), so considering long is 8 bytes on
Cygwin [1]https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows and 4 bytes on Windows the value could wrap but even if
that happens it is not clear how that can cause what we are seeing in
this case.

[1]: https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: intermittent failures in Cygwin from select_parallel tests

On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

But surely the silent treatment should only apply to DSM_OP_CREATE?

Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.

The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to attach to. But how could that happen?

Well, as Amit points out, there are entirely legitimate ways for that
to happen. If the leader finishes the whole query itself before the
worker reaches the dsm_attach() call, it will call dsm_detach(),
destroying the segment, and the worker will hit this ERROR. That
shouldn't happen very often in the real world, because we ought not to
select a parallel plan in the first place unless the query is going to
take a while to run, but the select_parallel test quite deliberately
disarms all of the guards that would tend to discourage such plans.

Of course, as Amit also points out, it could also be the result of
some bug, but I'm not sure we have any reason to think so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#16)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to attach to. But how could that happen?

Well, as Amit points out, there are entirely legitimate ways for that
to happen. If the leader finishes the whole query itself before the
worker reaches the dsm_attach() call, it will call dsm_detach(),
destroying the segment, and the worker will hit this ERROR. That
shouldn't happen very often in the real world, because we ought not to
select a parallel plan in the first place unless the query is going to
take a while to run, but the select_parallel test quite deliberately
disarms all of the guards that would tend to discourage such plans.

But we know, from the subsequent failed assertion, that the leader was
still trying to launch parallel workers. So that particular theory
doesn't hold water.

Of course, as Amit also points out, it could also be the result of
some bug, but I'm not sure we have any reason to think so.

The fact that we've only seen this on cygwin leads the mind in the
direction of platform-specific problems. Both this case and lorikeet's
earlier symptoms could be explained if the parameters passed from leader
to workers somehow got corrupted occasionally; so that's what I've been
thinking about, but I'm not seeing anything.

Would someone confirm my recollection that the cygwin build does *not*
use EXEC_BACKEND, but relies on a cygwin-provided emulation of fork()?
How does that emulation work, anyway?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)
Re: intermittent failures in Cygwin from select_parallel tests

On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, as Amit points out, there are entirely legitimate ways for that
to happen. If the leader finishes the whole query itself before the
worker reaches the dsm_attach() call, it will call dsm_detach(),
destroying the segment, and the worker will hit this ERROR. That
shouldn't happen very often in the real world, because we ought not to
select a parallel plan in the first place unless the query is going to
take a while to run, but the select_parallel test quite deliberately
disarms all of the guards that would tend to discourage such plans.

But we know, from the subsequent failed assertion, that the leader was
still trying to launch parallel workers. So that particular theory
doesn't hold water.

Is there any chance that it's already trying to launch parallel
workers for the *next* query?

Of course, as Amit also points out, it could also be the result of
some bug, but I'm not sure we have any reason to think so.

The fact that we've only seen this on cygwin leads the mind in the
direction of platform-specific problems. Both this case and lorikeet's
earlier symptoms could be explained if the parameters passed from leader
to workers somehow got corrupted occasionally; so that's what I've been
thinking about, but I'm not seeing anything.

Could be -- but it could also be timing-related. If we are in fact
using cygwin's fork emulation, the documentation for it explains that
it's slow: https://www.cygwin.com/faq.html#faq.api.fork

Interestingly, it also mentions that making it work requires
suspending the parent while the child is starting up, which probably
does not happen on any other platform. Of course it also makes my
theory that the child doesn't reach dsm_attach() before the parent
finishes the query pretty unlikely.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#18)
Re: intermittent failures in Cygwin from select_parallel tests

On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, as Amit points out, there are entirely legitimate ways for that
to happen. If the leader finishes the whole query itself before the
worker reaches the dsm_attach() call, it will call dsm_detach(),
destroying the segment, and the worker will hit this ERROR. That
shouldn't happen very often in the real world, because we ought not to
select a parallel plan in the first place unless the query is going to
take a while to run, but the select_parallel test quite deliberately
disarms all of the guards that would tend to discourage such plans.

But we know, from the subsequent failed assertion, that the leader was
still trying to launch parallel workers. So that particular theory
doesn't hold water.

Is there any chance that it's already trying to launch parallel
workers for the *next* query?

Yes, I think it is for next query. If you refer the log below from lorikeet:

2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement:
EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM
tenk1;
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map
dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process:
parallel worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select
stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count
- BackgroundWorkerData->parallel_terminate_count <= 1024)", File:
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process
(PID 10072) was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process
was running: select stringu1::int2 from tenk1 where unique1 = 1;
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any
other active server processes

Error "could not map dynamic shared memory segment" is due to query
"EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to
another statement "select stringu1::int2 from tenk1 where unique1 =
1;".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#18)
Re: intermittent failures in Cygwin from select_parallel tests

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But we know, from the subsequent failed assertion, that the leader was
still trying to launch parallel workers. So that particular theory
doesn't hold water.

Is there any chance that it's already trying to launch parallel
workers for the *next* query?

Oh! Yeah, you might be right, because the trace includes a statement
LOG entry from the leader in between:

2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM tenk1;
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= 1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted

It's fairly hard to read this other than as telling us that the worker was
launched for the EXPLAIN (although really? why aren't we skipping that if
EXEC_FLAG_EXPLAIN_ONLY?), and then we see a new LOG entry for the next
statement before the leader hits its assertion failure.

Could be -- but it could also be timing-related. If we are in fact
using cygwin's fork emulation, the documentation for it explains that
it's slow: https://www.cygwin.com/faq.html#faq.api.fork
Interestingly, it also mentions that making it work requires
suspending the parent while the child is starting up, which probably
does not happen on any other platform. Of course it also makes my
theory that the child doesn't reach dsm_attach() before the parent
finishes the query pretty unlikely.

Well, if this was a worker launched during InitPlan() for an EXPLAIN,
the leader would have shut down the query almost immediately after
launching the worker. So it does fit pretty well as long as you're
willing to believe that the leader got to run before the child.

But what this theory doesn't explain is: why haven't we seen this before?
It now seems like it ought to come up often, since there are several
EXPLAINs for parallel queries in that test.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#19)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#30)
#32Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#32)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#33)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#33)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#35)
#37Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#36)
#38Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#37)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Andrew Dunstan (#35)
#40Andrew Dunstan
andrew@dunslane.net
In reply to: Amit Kapila (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#38)
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#41)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Andrew Dunstan (#40)
#44Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#31)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#44)
#46Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#45)
#47Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#46)
#48Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#47)
#49Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#48)
#50Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#48)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#50)
#52Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#51)
#53Noah Misch
noah@leadboat.com
In reply to: Andrew Dunstan (#50)
#54Andrew Dunstan
andrew@dunslane.net
In reply to: Noah Misch (#53)