parallel.c oblivion of worker-startup failures

Started by Amit Kapilaover 8 years ago48 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

Sometime back Tom Lane has reported [1]/messages/by-id/4905.1492813727@sss.pgh.pa.us about $Subject. I have looked
into the issue and found that the problem is not only with parallel
workers but with general background worker machinery as well in
situations where fork or some such failure occurs. The first problem
is that after we register the dynamic worker, the way to know whether
the worker has started
(WaitForBackgroundWorkerStartup/GetBackgroundWorkerPid) won't give the
right answer if the fork failure happens. Also, in cases where the
worker is marked not to start after the crash, postmaster doesn't
notify the backend if it is not able to start the worker which can
make the backend wait forever as it is oblivion of the fact that the
worker is not started. Now, apart from these general problems of
background worker machinery, parallel.c assumes that after it has
registered the dynamic workers, they will start and perform their
work. We need to ensure that in case, postmaster is not able to start
parallel workers due to fork failure or any similar situations,
backend doesn't keep on waiting forever. To fix it, before waiting
for workers to finish, we can check whether the worker exists at all.
Attached patch fixes these problems.

Another way to fix the parallel query related problem is that after
registering the workers, the master backend should wait for workers to
start before setting up different queues for them to communicate. I
think that can be quite expensive.

Thoughts?

[1]: /messages/by-id/4905.1492813727@sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_worker_startup_failures_v1.patchapplication/octet-stream; name=fix_worker_startup_failures_v1.patchDownload+22-3
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#1)
Re: parallel.c oblivion of worker-startup failures

Amit Kapila <amit.kapila16@gmail.com> writes:

Attached patch fixes these problems.

Hmm, this patch adds a kill(notify_pid) after one call to
ForgetBackgroundWorker, but the postmaster has several more such calls.
Shouldn't they all notify the notify_pid? Should we move that
functionality into ForgetBackgroundWorker itself, so we can't forget
it again?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#2)
Re: parallel.c oblivion of worker-startup failures

On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Attached patch fixes these problems.

Hmm, this patch adds a kill(notify_pid) after one call to
ForgetBackgroundWorker, but the postmaster has several more such calls.
Shouldn't they all notify the notify_pid? Should we move that
functionality into ForgetBackgroundWorker itself, so we can't forget
it again?

Among other places, we already notify during
ReportBackgroundWorkerExit(). However, it seems to me that all other
places except where this patch has added a call to notify doesn't need
such a call. The other places like in DetermineSleepTime and
ResetBackgroundWorkerCrashTimes is called for a crashed worker which I
don't think requires notification to the backend as that backend
itself would have restarted. The other place where we call
ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate
is set to true which again seems to be either the case of worker crash
or when someone has explicitly asked to terminate the worker in which
case we already send a notification.

I think we need to notify the backend on start, stop and failure to
start a worker. OTOH, if it is harmless to send a notification even
after the worker is crashed, then we can just move that functionality
into ForgetBackgroundWorker itself as that will simplify the code and
infact that is the first thing that occurred to me as well, but I
haven't done that way as I was not sure if we want to send
notification in all kind of cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#3)
Re: parallel.c oblivion of worker-startup failures

On Tue, Sep 19, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Attached patch fixes these problems.

Hmm, this patch adds a kill(notify_pid) after one call to
ForgetBackgroundWorker, but the postmaster has several more such calls.
Shouldn't they all notify the notify_pid? Should we move that
functionality into ForgetBackgroundWorker itself, so we can't forget
it again?

Among other places, we already notify during
ReportBackgroundWorkerExit(). However, it seems to me that all other
places except where this patch has added a call to notify doesn't need
such a call. The other places like in DetermineSleepTime and
ResetBackgroundWorkerCrashTimes is called for a crashed worker which I
don't think requires notification to the backend as that backend
itself would have restarted. The other place where we call
ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate
is set to true which again seems to be either the case of worker crash
or when someone has explicitly asked to terminate the worker in which
case we already send a notification.

I think we need to notify the backend on start, stop and failure to
start a worker. OTOH, if it is harmless to send a notification even
after the worker is crashed, then we can just move that functionality
into ForgetBackgroundWorker itself as that will simplify the code and
infact that is the first thing that occurred to me as well, but I
haven't done that way as I was not sure if we want to send
notification in all kind of cases.

The patch still applies (with some hunks). I have added it in CF [1]https://commitfest.postgresql.org/15/1341/
to avoid losing track.

[1]: https://commitfest.postgresql.org/15/1341/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#4)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch still applies (with some hunks). I have added it in CF [1]
to avoid losing track.

[1] - https://commitfest.postgresql.org/15/1341/

This did not get reviews and the patch still applies. I am moving it to next CF.
--
Michael

#6Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#5)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch still applies (with some hunks). I have added it in CF [1]
to avoid losing track.

[1] - https://commitfest.postgresql.org/15/1341/

This did not get reviews and the patch still applies. I am moving it to next CF.

I'd like to break this into two patches, one to fix the general
background worker problems and another to fix the problems specific to
parallel query.

The attached version extracts the two fixes for general background
worker problems from Amit's patch and adjust the comments a bit more
extensively than he did. Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

bgworker-startup.patchapplication/octet-stream; name=bgworker-startup.patchDownload+24-8
#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#6)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch still applies (with some hunks). I have added it in CF [1]
to avoid losing track.

[1] - https://commitfest.postgresql.org/15/1341/

This did not get reviews and the patch still applies. I am moving it to next CF.

I'd like to break this into two patches, one to fix the general
background worker problems and another to fix the problems specific to
parallel query.

The attached version extracts the two fixes for general background
worker problems from Amit's patch and adjust the comments a bit more
extensively than he did.

Looks good to me.

Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

Sure, I will rebase on top of the commit unless you have some comments
on the remaining part.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#7)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Looks good to me.

Committed and back-patched to 9.4 where dynamic background workers
were introduced.

Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

Sure, I will rebase on top of the commit unless you have some comments
on the remaining part.

I'm not in love with that part of the fix; the other parts of that if
statement are just testing variables, and a function call that takes
and releases an LWLock is a lot more expensive. Furthermore, that
test will often be hit in practice, because we'll often arrive at this
function before workers have actually finished. On top of that, we'll
typically arrive here having already communicated with the worker in
some way, such as by receiving tuples, which means that in most cases
we already know that the worker was alive at least at some point, and
therefore the extra test isn't necessary. We only need that test, if
I understand correctly, to cover the failure-to-launch case, which is
presumably very rare.

All that having been said, I don't quite know how to do it better. It
would be easy enough to modify this function so that if it ever
received a result other then BGWH_NOT_YET_STARTED for a worker, it
wouldn't call this function again for that worker. However, that's
only mitigating the problem, not really fixing it. We'll still end up
frequently inquiring - admittedly only once - about the status of a
worker with which we've previously communicated via the tuple queue.
Maybe we just have to live with that problem, but do you (or does
anyone else) have an idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

Robert Haas <robertmhaas@gmail.com> writes:

I'm not in love with that part of the fix; the other parts of that if
statement are just testing variables, and a function call that takes
and releases an LWLock is a lot more expensive. Furthermore, that
test will often be hit in practice, because we'll often arrive at this
function before workers have actually finished. On top of that, we'll
typically arrive here having already communicated with the worker in
some way, such as by receiving tuples, which means that in most cases
we already know that the worker was alive at least at some point, and
therefore the extra test isn't necessary. We only need that test, if
I understand correctly, to cover the failure-to-launch case, which is
presumably very rare.

Maybe track "worker is known to have launched" in the leader's state
someplace?

regards, tom lane

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Dec 6, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe track "worker is known to have launched" in the leader's state
someplace?

That's probably one piece of it, yes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#8)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Dec 6, 2017 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Looks good to me.

Committed and back-patched to 9.4 where dynamic background workers
were introduced.

Thanks.

Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

Sure, I will rebase on top of the commit unless you have some comments
on the remaining part.

I'm not in love with that part of the fix; the other parts of that if
statement are just testing variables, and a function call that takes
and releases an LWLock is a lot more expensive. Furthermore, that
test will often be hit in practice, because we'll often arrive at this
function before workers have actually finished. On top of that, we'll
typically arrive here having already communicated with the worker in
some way, such as by receiving tuples, which means that in most cases
we already know that the worker was alive at least at some point, and
therefore the extra test isn't necessary. We only need that test, if
I understand correctly, to cover the failure-to-launch case, which is
presumably very rare.

All that having been said, I don't quite know how to do it better. It
would be easy enough to modify this function so that if it ever
received a result other then BGWH_NOT_YET_STARTED for a worker, it
wouldn't call this function again for that worker. However, that's
only mitigating the problem, not really fixing it. We'll still end up
frequently inquiring - admittedly only once - about the status of a
worker with which we've previously communicated via the tuple queue.
Maybe we just have to live with that problem, but do you (or does
anyone else) have an idea?

I think the optimization you are suggesting has a merit over what I
have done in the patch. However, one point to note is that we are
anyway going to call that function down in the path(
WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
so we might want to take the advantage of calling it first time as
well. We can actually cache the status of workers that have returned
BGWH_STOPPED and use it later so that we don't need to make this
function call again for workers which are already stopped. We can
even do what Tom is suggesting to avoid calling it for workers which
are known to be launched, but I am really not sure if that function
call is costly enough that we need to maintain one extra state to
avoid that.

While looking into this code path, I wonder how much we are gaining by
having two separate calls (WaitForParallelWorkersToFinish and
WaitForParallelWorkersToExit) to check the status of workers after
finishing the parallel execution? They are used together in rescan
code path, so apparently there is no benefit calling them separately
there. OTOH, during shutdown of nodes, it will often be the case that
they will be called in a short duration after each other except for
the case where we need to shut down the node for the early exit like
when there is a limit clause or such.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#11)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the optimization you are suggesting has a merit over what I
have done in the patch. However, one point to note is that we are
anyway going to call that function down in the path(
WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
so we might want to take the advantage of calling it first time as
well. We can actually cache the status of workers that have returned
BGWH_STOPPED and use it later so that we don't need to make this
function call again for workers which are already stopped. We can
even do what Tom is suggesting to avoid calling it for workers which
are known to be launched, but I am really not sure if that function
call is costly enough that we need to maintain one extra state to
avoid that.

Well, let's do what optimization we can without making it too complicated.

While looking into this code path, I wonder how much we are gaining by
having two separate calls (WaitForParallelWorkersToFinish and
WaitForParallelWorkersToExit) to check the status of workers after
finishing the parallel execution? They are used together in rescan
code path, so apparently there is no benefit calling them separately
there. OTOH, during shutdown of nodes, it will often be the case that
they will be called in a short duration after each other except for
the case where we need to shut down the node for the early exit like
when there is a limit clause or such.

I'm not 100% sure either, but if we're going to do anything about
that, it seems like a topic for another patch. I don't think it's
completely without value because there is some time after we
WaitForParallelWorkersToFinish and before we
WaitForParallelWorkersToExit during which we can do things like
retrieve instrumentation data and shut down other nodes, but whether
it pulls it weight in code I don't know. This kind of grew up
gradually: originally I/we didn't think of all of the cases where we
needed the workers to actually exit rather than just indicating that
they were done generating tuples, and the current situation is the
result of a series of bug-fixes related to that oversight, so it's
quite possible that a redesign would make sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#12)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Fri, Dec 8, 2017 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the optimization you are suggesting has a merit over what I
have done in the patch. However, one point to note is that we are
anyway going to call that function down in the path(
WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
so we might want to take the advantage of calling it first time as
well. We can actually cache the status of workers that have returned
BGWH_STOPPED and use it later so that we don't need to make this
function call again for workers which are already stopped. We can
even do what Tom is suggesting to avoid calling it for workers which
are known to be launched, but I am really not sure if that function
call is costly enough that we need to maintain one extra state to
avoid that.

Well, let's do what optimization we can without making it too complicated.

Okay, see the attached and let me know if that suffices the need?

While looking into this code path, I wonder how much we are gaining by
having two separate calls (WaitForParallelWorkersToFinish and
WaitForParallelWorkersToExit) to check the status of workers after
finishing the parallel execution? They are used together in rescan
code path, so apparently there is no benefit calling them separately
there. OTOH, during shutdown of nodes, it will often be the case that
they will be called in a short duration after each other except for
the case where we need to shut down the node for the early exit like
when there is a limit clause or such.

I'm not 100% sure either, but if we're going to do anything about
that, it seems like a topic for another patch.

makes sense.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_parallel_worker_startup_failures_v1.patchapplication/octet-stream; name=fix_parallel_worker_startup_failures_v1.patchDownload+37-2
#14Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#13)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, see the attached and let me know if that suffices the need?

+             * Check for unexpected worker death.  This will ensure that if
+             * the postmaster failed to start the worker, then we don't wait
+             * for it indefinitely.  For workers that are known to be
+             * launched, we can rely on their error queue being freed once
+             * they exit.

Hmm. Is this really true? What if the worker starts up but then
crashes before attaching to the error queue?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#14)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, see the attached and let me know if that suffices the need?

+             * Check for unexpected worker death.  This will ensure that if
+             * the postmaster failed to start the worker, then we don't wait
+             * for it indefinitely.  For workers that are known to be
+             * launched, we can rely on their error queue being freed once
+             * they exit.

Hmm. Is this really true? What if the worker starts up but then
crashes before attaching to the error queue?

If the worker errored out before attaching to the error queue, then we
can't rely on error queue being freed. However, in that case, the
worker status will be BGWH_STOPPED. I have adjusted the comment
accordingly.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_parallel_worker_startup_failures_v2.patchapplication/octet-stream; name=fix_parallel_worker_startup_failures_v2.patchDownload+36-2
#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#15)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Tue, Dec 12, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, see the attached and let me know if that suffices the need?

+             * Check for unexpected worker death.  This will ensure that if
+             * the postmaster failed to start the worker, then we don't wait
+             * for it indefinitely.  For workers that are known to be
+             * launched, we can rely on their error queue being freed once
+             * they exit.

Hmm. Is this really true? What if the worker starts up but then
crashes before attaching to the error queue?

If the worker errored out before attaching to the error queue, then we
can't rely on error queue being freed. However, in that case, the
worker status will be BGWH_STOPPED. I have adjusted the comment
accordingly.

In particular, if the worker crashed (say somebody forcefully killed
the worker), then I think it will lead to a restart of the server. So
not sure, adding anything about that in the comment will make much
sense.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#16)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In particular, if the worker crashed (say somebody forcefully killed
the worker), then I think it will lead to a restart of the server. So
not sure, adding anything about that in the comment will make much
sense.

I think it's dangerous to rely on that assumption. If somebody stuck
proc_exit(1) into the very beginning of the worker startup sequence,
the worker would exit but the server would not restart.

As I started experimenting with this, I discovered that your patch is
actually unsafe. Here's a test case that shows the breakage:

set force_parallel_mode = on;
select 1;

The expected outcome is:

?column?
----------
1
(1 row)

If I hack do_start_bgworker() to make worker startup fail, like this:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5701,7 +5701,7 @@ do_start_bgworker(RegisteredBgWorker *rw)
 #ifdef EXEC_BACKEND
     switch ((worker_pid = bgworker_forkexec(rw->rw_shmem_slot)))
 #else
-    switch ((worker_pid = fork_process()))
+    switch ((worker_pid = -1))
 #endif
     {
         case -1:

...then it hangs forever. With your patch it no longer hangs forever,
which is good, but it also returns the wrong answer, which is bad:

?column?
----------
(0 rows)

The problem is that Gather sees nworkers_launched > 0 and assumes that
it is therefore entitled not to perform a local scan. I'm pretty sure
that there could also be cases where this patch makes us miss an error
reported by a worker; suppose the worker error is reported and the
worker exits after WaitForParallelWorkersToFinish does
CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid().
The code will simply conclude that the worker is dead and stop
waiting, never mind the fact that there's a pending error in the
queue. This is exactly why I made parallel workers send a Terminate
message to the leader instead of just disappearing -- the leader needs
to see that message to know that it's reached the end of what the
worker is going to send.

So, I think we need to regard fork failure and related failure modes
as ERROR conditions. It's not equivalent to failing to register the
background worker. When we fail to register the background worker,
the parallel.c machinery knows at LaunchParallelWorkers() time that
the worker will never show up and lets the calling code by setting
nworkers_launched, and the calling code can then choose to proceed
with that number of workers or ERROR out as it chooses. But once we
decide on the value of nworkers_launched to report to callers, it's
really not OK for workers to just silently not ever start, as the test
case above shows. We could make it the job of code that uses the
ParallelContext machinery to cope with that situation, but I think
that's a bad plan, because every user would have to worry about this
obscure case. Right now, the only in-core use of the ParallelContext
machinery is Gather/Gather Merge, but there are pending patches for
parallel index scan and parallel VACUUM, so we'll just end up adding
this handling to more and more places. And for what? If you've got
max_connections and/or max_parallel_workers set so high that your
system can't fork(), you have a big problem, and forcing things that
would have run using parallel query to run serially instead of
erroring out is not going to fix it. I think that deciding we're
going to handle fork() failure by reporting an ERROR is just fine.

So, here's a patch. This patch keeps track of whether we've ever
received a message from each worker via the error queue. If there are
no remaining workers which have neither exited cleanly nor sent us at
least one message, then it calls GetBackgroundWorkerPid() for each
one. If any worker reports that it is BGWH_STOPPED, then it
double-checks whether the worker has attached to the error queue. If
it did, then it assumes that we'll detect clean shutdown or an error
on the next iteration and ignores the problem. If not, then the
worker died without ever attaching the error queue, so it throws an
ERROR. I tested that this causes the test case above to throw a
proper ERROR when fork_process() returns -1. Please review...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-worker-fork-failure.patchapplication/octet-stream; name=parallel-worker-fork-failure.patchDownload+83-2
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#17)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Dec 13, 2017 at 12:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In particular, if the worker crashed (say somebody forcefully killed
the worker), then I think it will lead to a restart of the server. So
not sure, adding anything about that in the comment will make much
sense.

The problem is that Gather sees nworkers_launched > 0 and assumes that
it is therefore entitled not to perform a local scan.

Yeah, I think that assumption is not right. I think ideally before
assuming that it should wait for workers to startup.

I'm pretty sure
that there could also be cases where this patch makes us miss an error
reported by a worker; suppose the worker error is reported and the
worker exits after WaitForParallelWorkersToFinish does
CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid().
The code will simply conclude that the worker is dead and stop
waiting, never mind the fact that there's a pending error in the
queue.

I have tried to reproduce this situation by adding error case in
worker (just before worker returns success ('X' message)) and by
having breakpoint in WaitForParallelWorkersToFinish. What, I noticed
is that worker is not allowed to exit till it receives some ack from
master (basically worker waits at SendProcSignal after sending
message) and error is reported properly.

This is exactly why I made parallel workers send a Terminate
message to the leader instead of just disappearing -- the leader needs
to see that message to know that it's reached the end of what the
worker is going to send.

So, I think we need to regard fork failure and related failure modes
as ERROR conditions. It's not equivalent to failing to register the
background worker. When we fail to register the background worker,
the parallel.c machinery knows at LaunchParallelWorkers() time that
the worker will never show up and lets the calling code by setting
nworkers_launched, and the calling code can then choose to proceed
with that number of workers or ERROR out as it chooses. But once we
decide on the value of nworkers_launched to report to callers, it's
really not OK for workers to just silently not ever start, as the test
case above shows. We could make it the job of code that uses the
ParallelContext machinery to cope with that situation, but I think
that's a bad plan, because every user would have to worry about this
obscure case. Right now, the only in-core use of the ParallelContext
machinery is Gather/Gather Merge, but there are pending patches for
parallel index scan and parallel VACUUM, so we'll just end up adding
this handling to more and more places. And for what? If you've got
max_connections and/or max_parallel_workers set so high that your
system can't fork(), you have a big problem, and forcing things that
would have run using parallel query to run serially instead of
erroring out is not going to fix it. I think that deciding we're
going to handle fork() failure by reporting an ERROR is just fine.

So, if I understand correctly, this means that it will return an error
even if there is a problem in one worker (exited or not started) even
though other workers would have easily finished the work. Won't such
an error can lead to situations where after running the query for one
hour the user might see some failure just because one of the many
workers is not started (or some similar failure) even though the query
would have been completed without that? If so, that doesn't sound
like a sensible behavior.

So, here's a patch. This patch keeps track of whether we've ever
received a message from each worker via the error queue. If there are
no remaining workers which have neither exited cleanly nor sent us at
least one message, then it calls GetBackgroundWorkerPid() for each
one. If any worker reports that it is BGWH_STOPPED, then it
double-checks whether the worker has attached to the error queue. If
it did, then it assumes that we'll detect clean shutdown or an error
on the next iteration and ignores the problem. If not, then the
worker died without ever attaching the error queue, so it throws an
ERROR. I tested that this causes the test case above to throw a
proper ERROR when fork_process() returns -1. Please review...

This also doesn't appear to be completely safe. If we add
proc_exit(1) after attaching to error queue (say after
pq_set_parallel_master) in the worker, then it will lead to *hang* as
anyone_alive will still be false and as it will find that the sender
is set for the error queue, it won't return any failure. Now, I think
even if we check worker status (which will be stopped) and break after
the new error condition, it won't work as it will still return zero
rows in the case reported by you above.

I think if before making an assumption not to scan locally if we have
a check to see whether workers are started, then my patch should work
fine and we don't need to add any additional checks. Also, it won't
create any performance issue as we will perform such a check only when
force_parallel_mode is not off or parallel leader participation is off
which I think are not common cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#18)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have tried to reproduce this situation by adding error case in
worker (just before worker returns success ('X' message)) and by
having breakpoint in WaitForParallelWorkersToFinish. What, I noticed
is that worker is not allowed to exit till it receives some ack from
master (basically worker waits at SendProcSignal after sending
message) and error is reported properly.

SendProcSignal doesn't look to me like it can do anything that can
block, so I don't understand this.

So, if I understand correctly, this means that it will return an error
even if there is a problem in one worker (exited or not started) even
though other workers would have easily finished the work. Won't such
an error can lead to situations where after running the query for one
hour the user might see some failure just because one of the many
workers is not started (or some similar failure) even though the query
would have been completed without that? If so, that doesn't sound
like a sensible behavior.

I think it's the *only* sensible behavior. parallel.c does not know
that the query would have completed successfully without that worker,
or at least it doesn't know that it would have returned the right
answer. It *might* be the case that the caller wasn't depending on
that worker to do anything, but parallel.c doesn't know that.

This also doesn't appear to be completely safe. If we add
proc_exit(1) after attaching to error queue (say after
pq_set_parallel_master) in the worker, then it will lead to *hang* as
anyone_alive will still be false and as it will find that the sender
is set for the error queue, it won't return any failure. Now, I think
even if we check worker status (which will be stopped) and break after
the new error condition, it won't work as it will still return zero
rows in the case reported by you above.

Hmm, there might still be a problem there. I was thinking that once
the leader attaches to the queue, we can rely on the leader reaching
"ERROR: lost connection to parallel worker" in HandleParallelMessages.
However, that might not work because nothing sets
ParallelMessagePending in that case. The worker will have detached
the queue via shm_mq_detach_callback, but the leader will only
discover the detach if it actually tries to read from that queue.

I think if before making an assumption not to scan locally if we have
a check to see whether workers are started, then my patch should work
fine and we don't need to add any additional checks. Also, it won't
create any performance issue as we will perform such a check only when
force_parallel_mode is not off or parallel leader participation is off
which I think are not common cases.

My instinct is that we're going to have a lot of subtle bugs if
clients of parallel.c can't count on nworkers_launched to be accurate.
If clients didn't need that value, we could just have nworkers and let
the callers figure it out, but nobody wants that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#19)
Re: [HACKERS] parallel.c oblivion of worker-startup failures

On Thu, Dec 14, 2017 at 3:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

This also doesn't appear to be completely safe. If we add
proc_exit(1) after attaching to error queue (say after
pq_set_parallel_master) in the worker, then it will lead to *hang* as
anyone_alive will still be false and as it will find that the sender
is set for the error queue, it won't return any failure. Now, I think
even if we check worker status (which will be stopped) and break after
the new error condition, it won't work as it will still return zero
rows in the case reported by you above.

Hmm, there might still be a problem there. I was thinking that once
the leader attaches to the queue, we can rely on the leader reaching
"ERROR: lost connection to parallel worker" in HandleParallelMessages.
However, that might not work because nothing sets
ParallelMessagePending in that case. The worker will have detached
the queue via shm_mq_detach_callback, but the leader will only
discover the detach if it actually tries to read from that queue.

I think it would have been much easier to fix this problem if we would
have some way to differentiate whether the worker has stopped
gracefully or not. Do you think it makes sense to introduce such a
state in the background worker machinery?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#22)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#26)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#31)
In reply to: Robert Haas (#32)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#32)
#35Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#34)
In reply to: Thomas Munro (#35)
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#36)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#35)
#39Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#38)
In reply to: Thomas Munro (#39)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#40)
In reply to: Amit Kapila (#41)
#43Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#35)
In reply to: Thomas Munro (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#34)
In reply to: Robert Haas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#46)
In reply to: Robert Haas (#47)