strange parallel query behavior after OOM crashes

Started by Tomas Vondraabout 9 years ago39 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

While doing some benchmarking, I've ran into a fairly strange issue with
OOM breaking LaunchParallelWorkers() after the restart. What I see
happening is this:

1) a query is executed, and at the end of LaunchParallelWorkers we get

nworkers=8 nworkers_launched=8

2) the query does a Hash Aggregate, but ends up eating much more memory
due to n_distinct underestimate (see [1]/messages/by-id/CAFWGqnsxryEevA5A_CqT3dExmTaT44mBpNTy8TWVsSVDS71QMg@mail.gmail.com from 2015 for details), and
gets killed by OOM

3) the server restarts, the query is executed again, but this time we
get in LaunchParallelWorkers

nworkers=8 nworkers_launched=0

There's nothing else running on the server, and there definitely should
be free parallel workers.

4) The query gets killed again, and on the next execution we get

nworkers=8 nworkers_launched=8

again, although not always. I wonder whether the exact impact depends on
OOM killing the leader or worker, for example.

regards

[1]: /messages/by-id/CAFWGqnsxryEevA5A_CqT3dExmTaT44mBpNTy8TWVsSVDS71QMg@mail.gmail.com
/messages/by-id/CAFWGqnsxryEevA5A_CqT3dExmTaT44mBpNTy8TWVsSVDS71QMg@mail.gmail.com

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#1)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

While doing some benchmarking, I've ran into a fairly strange issue with OOM
breaking LaunchParallelWorkers() after the restart. What I see happening is
this:

1) a query is executed, and at the end of LaunchParallelWorkers we get

nworkers=8 nworkers_launched=8

2) the query does a Hash Aggregate, but ends up eating much more memory due
to n_distinct underestimate (see [1] from 2015 for details), and gets killed
by OOM

3) the server restarts, the query is executed again, but this time we get in
LaunchParallelWorkers

nworkers=8 nworkers_launched=0

There's nothing else running on the server, and there definitely should be
free parallel workers.

4) The query gets killed again, and on the next execution we get

nworkers=8 nworkers_launched=8

again, although not always. I wonder whether the exact impact depends on OOM
killing the leader or worker, for example.

I don't know what's going on but I think I have seen this once or
twice myself while hacking on test code that crashed. I wonder if the
DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because
the DSM control is somehow confused?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Thomas Munro (#2)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 12:32 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

While doing some benchmarking, I've ran into a fairly strange issue with OOM
breaking LaunchParallelWorkers() after the restart. What I see happening is
this:

1) a query is executed, and at the end of LaunchParallelWorkers we get

nworkers=8 nworkers_launched=8

2) the query does a Hash Aggregate, but ends up eating much more memory due
to n_distinct underestimate (see [1] from 2015 for details), and gets killed
by OOM

3) the server restarts, the query is executed again, but this time we get in
LaunchParallelWorkers

nworkers=8 nworkers_launched=0

There's nothing else running on the server, and there definitely should be
free parallel workers.

4) The query gets killed again, and on the next execution we get

nworkers=8 nworkers_launched=8

again, although not always. I wonder whether the exact impact depends on OOM
killing the leader or worker, for example.

I don't know what's going on but I think I have seen this once or
twice myself while hacking on test code that crashed. I wonder if the
DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because
the DSM control is somehow confused?

I think I've run into the same problem while working on parallelizing
plans containing InitPlans. You can reproduce that scenario by
following steps:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
In LaunchParallelWorkers, you can see
nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

I thought because of my stupid mistake the parallel worker is
crashing, so, this is supposed to happen. Sorry for that.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#3)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
In LaunchParallelWorkers, you can see
nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

parallel_register_count and parallel_terminate_count, both are
unsigned integer. So, whenever the difference is negative, it'll be a
well-defined unsigned integer and certainly much larger than
max_parallel_workers. Hence, no workers will be launched. I've
attached a patch to fix this.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix-workers-launched.patchbinary/octet-stream; name=fix-workers-launched.patchDownload+1-1
#5Neha Khatri
nehakhatri5@gmail.com
In reply to: Kuntal Ghosh (#4)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
wrote:

On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
In LaunchParallelWorkers, you can see
nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

parallel_register_count and parallel_terminate_count, both are
unsigned integer. So, whenever the difference is negative, it'll be a
well-defined unsigned integer and certainly much larger than
max_parallel_workers. Hence, no workers will be launched. I've
attached a patch to fix this.

The current explanation of active number of parallel workers is:

* The active
* number of parallel workers is the number of registered workers minus the
* terminated ones.

In the situations like you mentioned above, this formula can give negative
number for active parallel workers. However a negative number for active
parallel workers does not make any sense.

I feel it would be better to explain in code that in what situations, the
formula
can generate a negative result and what that means.

Regards,
Neha

#6Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Neha Khatri (#5)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 5:43 AM, Neha Khatri <nehakhatri5@gmail.com> wrote:

On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
wrote:

On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

1. Put an Assert(0) in ParallelQueryMain(), start server and execute
any parallel query.
In LaunchParallelWorkers, you can see
nworkers = n nworkers_launched = n (n>0)
But, all the workers will crash because of the assert statement.
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count. In LaunchParallelWorkers, we have the
following condition:
if ((BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
DO NOT launch any parallel worker.
Hence, nworkers = n nworkers_launched = 0.

parallel_register_count and parallel_terminate_count, both are
unsigned integer. So, whenever the difference is negative, it'll be a
well-defined unsigned integer and certainly much larger than
max_parallel_workers. Hence, no workers will be launched. I've
attached a patch to fix this.

The current explanation of active number of parallel workers is:

* The active
* number of parallel workers is the number of registered workers minus the
* terminated ones.

In the situations like you mentioned above, this formula can give negative
number for active parallel workers. However a negative number for active
parallel workers does not make any sense.

Agreed.

I feel it would be better to explain in code that in what situations, the
formula
can generate a negative result and what that means.

I think that we need to find a fix so that it never generates a
negative result. The last patch submitted by me generates a negative
value correctly. But, surely that's not enough.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#3)
Re: strange parallel query behavior after OOM crashes

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#7)
Re: strange parallel query behavior after OOM crashes

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Fix-parallel-worker-counts-after-a-crash.patchbinary/octet-stream; name=0001-Fix-parallel-worker-counts-after-a-crash.patchDownload+14-2
#9Neha Khatri
nehakhatri5@gmail.com
In reply to: Kuntal Ghosh (#8)
Re: strange parallel query behavior after OOM crashes

Looking further in this context, number of active parallel workers is:
parallel_register_count - parallel_terminate_count

Can active workers ever be greater than max_parallel_workers, I think no.
Then why should there be greater than check in the following condition:

if (parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers)

I feel there should be an assert if
(BackgroundWorkerData->parallel_register_count
- BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)

And the check could be
if (parallel && (active_parallel_workers == max_parallel_workers))
then do not register new parallel wokers and return.

There should be no tolerance for the case when active_parallel_workers >
max_parallel_workers. After all that is the purpose of max_parallel_workers.

Is it like multiple backends trying to register parallel workers at the
same time, for which the greater than check should be present?

Thoughts?

Regards,
Neha

#10Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#8)
Re: strange parallel query behavior after OOM crashes

On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it. If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine. But with what you've got here, you're
essentially relying on "spooky action at a distance". It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

BTW, if this isn't on the open items list, it should be. It's
presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#10)
Re: strange parallel query behavior after OOM crashes

On 04/04/2017 06:52 PM, Robert Haas wrote:

On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it. If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine. But with what you've got here, you're
essentially relying on "spooky action at a distance". It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM
followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply
reset the values back to 0? Or do we call ForgetBackgroundWorker() after
the crash for some reason?

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+	Assert(BackgroundWorkerData->parallel_register_count >=
+		BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

BTW, if this isn't on the open items list, it should be. It's
presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.

Unrelated, but perhaps we should also add this to open items:

/messages/by-id/57bbc52c-5d40-bb80-2992-7e1027d0b67c@2ndquadrant.com

(although that's more a 9.6 material).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tomas Vondra (#11)
Re: strange parallel query behavior after OOM crashes

On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 04/04/2017 06:52 PM, Robert Haas wrote:

On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
wrote:

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it. If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine. But with what you've got here, you're
essentially relying on "spooky action at a distance". It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM followed
by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
some reason?

AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart. If a bgworker is crashed and not
meant for a restart(parallel worker in our case), we call
ForgetBackgroundWorker() to discard it.

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+       Assert(BackgroundWorkerData->parallel_register_count >=
+               BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

IIUC, the assert statements ensures that register count should always
be greater than or equal to the terminate count.
Whereas, the comment refers to the fact that register count and
terminate count indicate the total number of parallel workers spawned
and terminated respectively since the server has been started. At
every session, we're not resetting the counts, hence they can
overflow. But, their substraction should give you the number of active
parallel worker at any instance.
So, I'm not able to see how the assert is broken w.r.t the aforesaid
comment. Am I missing something here?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Neha Khatri (#9)
Re: strange parallel query behavior after OOM crashes

On Tue, Apr 4, 2017 at 12:16 PM, Neha Khatri <nehakhatri5@gmail.com> wrote:

I feel there should be an assert if
(BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)

Backend 1 > SET max_parallel_worker = 8;
Backend 1 > Execute a long running parallel query q1 with number of
parallel worker spawned is say, 4.
Backend 2> SET max_parallel_worker = 3;
Backend 2 > Execute a long running parallel query q2 with number of
parallel worker spawned > 0.

The above assert statement will bring down the server unnecessarily
while executing q2.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#12)
Re: strange parallel query behavior after OOM crashes

On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:

On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 04/04/2017 06:52 PM, Robert Haas wrote:

On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
wrote:

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it. If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine. But with what you've got here, you're
essentially relying on "spooky action at a distance". It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM followed
by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
some reason?

AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart. If a bgworker is crashed and not
meant for a restart(parallel worker in our case), we call
ForgetBackgroundWorker() to discard it.

OK, so essentially we reset the counters, getting

parallel_register_count = 0
parallel_terminate_count = 0

and then ForgetBackgroundWworker() runs and increments the
terminate_count, breaking the logic?

And you're using (parallel_register_count > 0) to detect whether we're
still in the init phase or not. Hmmm.

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+       Assert(BackgroundWorkerData->parallel_register_count >=
+               BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

IIUC, the assert statements ensures that register count should always
be greater than or equal to the terminate count.
Whereas, the comment refers to the fact that register count and
terminate count indicate the total number of parallel workers spawned
and terminated respectively since the server has been started. At
every session, we're not resetting the counts, hence they can
overflow. But, their substraction should give you the number of active
parallel worker at any instance.
So, I'm not able to see how the assert is broken w.r.t the aforesaid
comment. Am I missing something here?

The comment says that the counters are allowed to overflow, i.e. after a
long uptime you might get these values

parallel_register_count = UINT_MAX + 1 = 1
parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and so

parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values
directly:

Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tomas Vondra (#14)
Re: strange parallel query behavior after OOM crashes

On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:

AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart. If a bgworker is crashed and not
meant for a restart(parallel worker in our case), we call
ForgetBackgroundWorker() to discard it.

OK, so essentially we reset the counters, getting

parallel_register_count = 0
parallel_terminate_count = 0

and then ForgetBackgroundWworker() runs and increments the terminate_count,
breaking the logic?

And you're using (parallel_register_count > 0) to detect whether we're still
in the init phase or not. Hmmm.

Yes. But, as Robert suggested up in the thread, we should not use
(parallel_register_count = 0) as an alternative to define a bgworker
crash. Hence, I've added an argument named 'wasCrashed' in
ForgetBackgroundWorker to indicate a bgworker crash.

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+       Assert(BackgroundWorkerData->parallel_register_count >=
+               BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

IIUC, the assert statements ensures that register count should always
be greater than or equal to the terminate count.
Whereas, the comment refers to the fact that register count and
terminate count indicate the total number of parallel workers spawned
and terminated respectively since the server has been started. At
every session, we're not resetting the counts, hence they can
overflow. But, their substraction should give you the number of active
parallel worker at any instance.
So, I'm not able to see how the assert is broken w.r.t the aforesaid
comment. Am I missing something here?

The comment says that the counters are allowed to overflow, i.e. after a
long uptime you might get these values

parallel_register_count = UINT_MAX + 1 = 1
parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and so

parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values directly:

Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

Thanks for the explanation. So, we can't use the above assert
statement. Even the following assert statement will not be helpful:
Assert(parallel_register_count - parallel_terminate_count >= 0);
Because, it'll fail to track the scenario when parallel_register_count
is not overflowed, still less than parallel_terminate_count. :(

It seems to me the only thing we can make sure is
(parallel_register_count == parallel_terminate_count == 0) in
ForgetBackgroundWorker in case of a crash.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#15)
Re: strange parallel query behavior after OOM crashes

On 04/05/2017 12:36 PM, Kuntal Ghosh wrote:

On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:

AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart. If a bgworker is crashed and not
meant for a restart(parallel worker in our case), we call
ForgetBackgroundWorker() to discard it.

OK, so essentially we reset the counters, getting

parallel_register_count = 0
parallel_terminate_count = 0

and then ForgetBackgroundWworker() runs and increments the terminate_count,
breaking the logic?

And you're using (parallel_register_count > 0) to detect whether we're still
in the init phase or not. Hmmm.

Yes. But, as Robert suggested up in the thread, we should not use
(parallel_register_count = 0) as an alternative to define a bgworker
crash. Hence, I've added an argument named 'wasCrashed' in
ForgetBackgroundWorker to indicate a bgworker crash.

Sure, and I agree that having an explicit flag is the right solution.
I'm just trying to make sure I understand what's happening.

The comment says that the counters are allowed to overflow, i.e. after a
long uptime you might get these values

parallel_register_count = UINT_MAX + 1 = 1
parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and so

parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values directly:

Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

Thanks for the explanation. So, we can't use the above assert
statement. Even the following assert statement will not be helpful:
Assert(parallel_register_count - parallel_terminate_count >= 0);
Because, it'll fail to track the scenario when parallel_register_count
is not overflowed, still less than parallel_terminate_count. :(

Actually, that assert would work, because C does handle overflows on
uint values during subtraction just fine. That is,

(UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims
this is the case.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tomas Vondra (#16)
Re: strange parallel query behavior after OOM crashes

On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The comment says that the counters are allowed to overflow, i.e. after a
long uptime you might get these values

parallel_register_count = UINT_MAX + 1 = 1
parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and
so

parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values
directly:

Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

Thanks for the explanation. So, we can't use the above assert
statement. Even the following assert statement will not be helpful:
Assert(parallel_register_count - parallel_terminate_count >= 0);
Because, it'll fail to track the scenario when parallel_register_count
is not overflowed, still less than parallel_terminate_count. :(

Actually, that assert would work, because C does handle overflows on uint
values during subtraction just fine. That is,

(UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims this
is the case.

Agreed on the overflowed case. But, my concern is when an overflow has
not yet happened:

Suppose,
uint parallel_register_count = 1; /* Didn't overflow* /
uint parallel_terminate_count = 2; /* Didn't overflow */

Assert(parallel_register_count - parallel_terminate_count >= 0);
We want the assert statement to fail here, but I think it won't since
-1 has a valid representation in unsigned int and it is greater than
0, no?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#17)
Re: strange parallel query behavior after OOM crashes

On 04/05/2017 01:09 PM, Kuntal Ghosh wrote:

On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The comment says that the counters are allowed to overflow, i.e. after a
long uptime you might get these values

parallel_register_count = UINT_MAX + 1 = 1
parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and
so

parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values
directly:

Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

Thanks for the explanation. So, we can't use the above assert
statement. Even the following assert statement will not be helpful:
Assert(parallel_register_count - parallel_terminate_count >= 0);
Because, it'll fail to track the scenario when parallel_register_count
is not overflowed, still less than parallel_terminate_count. :(

Actually, that assert would work, because C does handle overflows on uint
values during subtraction just fine. That is,

(UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims this
is the case.

Agreed on the overflowed case. But, my concern is when an overflow has
not yet happened:

Suppose,
uint parallel_register_count = 1; /* Didn't overflow* /
uint parallel_terminate_count = 2; /* Didn't overflow */

Assert(parallel_register_count - parallel_terminate_count >= 0);
We want the assert statement to fail here, but I think it won't since
-1 has a valid representation in unsigned int and it is greater than
0, no?

I don't follow. How exactly do you get into this situation, assuming you
actually enforce the (register_count > terminate_count) invariant
consistently? In the modulo arithmetic of course.

But now that I'm thinking about it, the subtraction actually happens in
unsigned ints, so the result will be unsigned int too, i.e. always >=0.
Whether it makes sense depends on the invariant being true.

But I think this would work:

Assert(parallel_register_count - parallel_terminate_count <=
max_parallel_workers);

If the difference gets 'negative' and wraps around, it'll be close to
UINT_MAX.

But man, this unsigned int arithmetic makes my brain hurt ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Kuntal Ghosh (#12)
Re: strange parallel query behavior after OOM crashes

On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 04/04/2017 06:52 PM, Robert Haas wrote:

On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
wrote:

On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm. So this seems like the root of the problem. Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it. If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine. But with what you've got here, you're
essentially relying on "spooky action at a distance". It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM followed
by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
some reason?

AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart.

In general, your theory appears right, but can you check how it
behaves in standby server because there is a difference in how the
startup process behaves during master and standby startup? In master,
it stops after recovery whereas in standby it will keep on running to
receive WAL.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#11)
Re: strange parallel query behavior after OOM crashes

On Tue, Apr 4, 2017 at 1:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

In any case, the comment right before BackgroundWorkerArray says this:

* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.

which means that this assert

+       Assert(BackgroundWorkerData->parallel_register_count >=
+               BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?

Yeah, that's busted. Good catch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#15)
#22Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#22)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#22)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#23)
#26Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#23)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#26)
#28Neha Khatri
nehakhatri5@gmail.com
In reply to: Kuntal Ghosh (#13)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Neha Khatri (#28)
#30Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Amit Kapila (#19)
#31Noah Misch
noah@leadboat.com
In reply to: Kuntal Ghosh (#30)
#32Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#29)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#30)
#34Neha Khatri
nehakhatri5@gmail.com
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Neha Khatri (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#32)
#37Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#35)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#36)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#38)