explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

Started by hubert depesz lubaczewskiover 8 years ago43 messageshackers
Jump to latest

Hi,

up to parallel executions, when we had node in explain analyze showing
"loops=x" with x more than 1, it meant that the "actual time" had to be
multiplied by loops to get real time spent in a node.

For example, check step 13 in https://explain.depesz.com/s/gNBd

It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.

But with parallel execution it seems to be no longer the case.

For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi

It looks that the actual time is really actual time, and loops is
"worker nodes + 1".

Is that really the case? Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?

or am I missing something in here?

Best regards,

depesz

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: hubert depesz lubaczewski (#1)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Fri, Nov 24, 2017 at 4:51 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

Hi,

up to parallel executions, when we had node in explain analyze showing
"loops=x" with x more than 1, it meant that the "actual time" had to be
multiplied by loops to get real time spent in a node.

For example, check step 13 in https://explain.depesz.com/s/gNBd

It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.

But with parallel execution it seems to be no longer the case.

For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi

It looks that the actual time is really actual time, and loops is
"worker nodes + 1".

Is that really the case?

I think so.

Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?

I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#3Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Kapila (#2)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Sat, Nov 25, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

It looks that the actual time is really actual time, and loops is
"worker nodes + 1".

Is that really the case?

I think so.

To add to what Amit has explained. The + 1 is for the leader (the
backend running the whole query) which also runs the parallel plan if
it gets time between fetching rows from the workers.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In reply to: Amit Kapila (#2)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:

For example, check step 13 in https://explain.depesz.com/s/gNBd

It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.

But with parallel execution it seems to be no longer the case.

For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi
Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?

I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.

Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)

Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.

In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.

The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?

Anything "below" "Gather"?
Anything starting with "Partial?"
Anything starting with "Parallel"?
Anything with "Worker" in node "description" in explain?

depesz

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: hubert depesz lubaczewski (#4)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Mon, Nov 27, 2017 at 2:45 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:

For example, check step 13 in https://explain.depesz.com/s/gNBd

It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.

But with parallel execution it seems to be no longer the case.

For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi
Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?

I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.

Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)

Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.

In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.

The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?

Anything "below" "Gather"?

I think it is "actual_time * 1" for anything below Gather.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In reply to: Amit Kapila (#5)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:

I think it is "actual_time * 1" for anything below Gather.

Well, I think I found another problem.

Please take a look at:
https://explain.depesz.com/s/Bs8c

Node 15 is Gather with loops = 2974 (because it was ran by nested loop).

There were 4 workers, so 5 threads working, but the loops on parallel
seq scan is 17.698 million ?!

The problem is that for explain.depesz.com I'm calculating how much time
pg actually spent doing given thing.

So, if an operation has actual time of 1ms, but 100 loops, it took
100ms.

In case of parallel operations it looks like I can't realistically find
the information anywhere?

In the explain above, we see that Nested loop took, in total, around
2 million miliseconds (step 9).
Out of which, ~ 7000 ms was hash join and it's subnodes.
This leaves most of the time for Gather step, which was called 2974
times with actual time reported as 618ms.
2974 * 618 is 1837932 miliseconds, which seems reasonable.

But then - how much of this time was spent in Parallel Seq Scan in step
#16 ?

2974 * 609ms? Why is loops there 17 million?

Best regards,

depesz

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: hubert depesz lubaczewski (#6)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Mon, Nov 27, 2017 at 11:26 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:

On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:

I think it is "actual_time * 1" for anything below Gather.

Well, I think I found another problem.

Please take a look at:
https://explain.depesz.com/s/Bs8c

Node 15 is Gather with loops = 2974 (because it was ran by nested loop).

There were 4 workers, so 5 threads working, but the loops on parallel
seq scan is 17.698 million ?!

The number of loops displayed on parallel seq scan seems to be wrong, see below.

The problem is that for explain.depesz.com I'm calculating how much time
pg actually spent doing given thing.

So, if an operation has actual time of 1ms, but 100 loops, it took
100ms.

In case of parallel operations it looks like I can't realistically find
the information anywhere?

In the explain above, we see that Nested loop took, in total, around
2 million miliseconds (step 9).
Out of which, ~ 7000 ms was hash join and it's subnodes.
This leaves most of the time for Gather step, which was called 2974
times with actual time reported as 618ms.
2974 * 618 is 1837932 miliseconds, which seems reasonable.

But then - how much of this time was spent in Parallel Seq Scan in step
#16 ?

2974 * 609ms?

Yeah.

Why is loops there 17 million?

That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_accum_instr_parallel_workers_v1.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v1.patchDownload+30-7
In reply to: Amit Kapila (#7)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Nov 28, 2017 at 12:53:41PM +0530, Amit Kapila wrote:

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

well, not really. the explains i had were posted by people on
explain.depesz.com, so i don't have original queries nor their datasets.

Best regards,

depesz

#9Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#5)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Mon, Nov 27, 2017 at 6:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Anything "below" "Gather"?

I think it is "actual_time * 1" for anything below Gather.

The actual time amounts below Gather show total elapsed time divided
by loop count, just as they do anywhere else in the plan. The loop
count tracks the number of times the plan was executed, just as it
does anywhere else in the plan. So for example if there are 5
participants which each execute the node once, and the times spent are
5 s, 60 s, 60 s, 60 s, and 60 s, you'll get actual time = 49 s, loops
= 5.

If you want to know the total amount of time spent under the node,
then you have to multiply the actual time (49 s in this example) by
the loop count (5 in this example) just as you would for any other
plan node. However, you have to keep in mind that, for a parallel
query, the total time spent under the node is not the same as the
elapsed time. In this example, if all 5 workers started at the same
time and ran the node continuously without a break, the *elapsed* time
until they all finished would be 60 s, not 49 s, a value that EXPLAIN
will report nowhere. But they can also start and stop executing under
that node repeatedly and they need not all start at the same time,
making the concept of elapsed time a bit unclear -- earliest start to
latest finish would be one way, but that will be misleading (perhaps)
if they spend 5 seconds a piece but start staggered at 4 second
intervals.

It's really hard to understand what's actually going on with a
parallel query unless you look at each worker individually, and even
then sometimes it's not as clear as it could be. I'm not sure what to
do about that. Elsewhere, trying to show the leader's information
separately when VERBOSE is used has been discussed, and I think that's
a good idea, but it may not be enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#7)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#10)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Nov 28, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.

I think so, but not 100% sure. I will give it a try and report back.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#9)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Nov 28, 2017 at 9:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 27, 2017 at 6:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Anything "below" "Gather"?

I think it is "actual_time * 1" for anything below Gather.

The actual time amounts below Gather show total elapsed time divided
by loop count, just as they do anywhere else in the plan. The loop
count tracks the number of times the plan was executed, just as it
does anywhere else in the plan. So for example if there are 5
participants which each execute the node once, and the times spent are
5 s, 60 s, 60 s, 60 s, and 60 s, you'll get actual time = 49 s, loops
= 5.

If you want to know the total amount of time spent under the node,
then you have to multiply the actual time (49 s in this example) by
the loop count (5 in this example) just as you would for any other
plan node. However, you have to keep in mind that, for a parallel
query, the total time spent under the node is not the same as the
elapsed time. In this example, if all 5 workers started at the same
time and ran the node continuously without a break, the *elapsed* time
until they all finished would be 60 s, not 49 s, a value that EXPLAIN
will report nowhere. But they can also start and stop executing under
that node repeatedly and they need not all start at the same time,
making the concept of elapsed time a bit unclear -- earliest start to
latest finish would be one way, but that will be misleading (perhaps)
if they spend 5 seconds a piece but start staggered at 4 second
intervals.

It's really hard to understand what's actually going on with a
parallel query unless you look at each worker individually, and even
then sometimes it's not as clear as it could be.

I think stats at Gather node itself gives a somewhat right picture.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#11)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Wed, Nov 29, 2017 at 2:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 28, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.

I think so, but not 100% sure. I will give it a try and report back.

Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_accum_instr_parallel_workers_v2.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v2.patchDownload+56-7
#14Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#13)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Sat, Dec 2, 2017 at 8:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.

The regression test contains a whitespace error about which git diff
--check complains.

Also, looking at this again, shouldn't the reinitialization of the
instrumentation arrays happen in ExecParallelReinitialize rather than
ExecParallelFinish, so that we don't spend time doing it unless the
Gather is actually re-executed?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#14)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Mon, Dec 4, 2017 at 11:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Dec 2, 2017 at 8:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.

The regression test contains a whitespace error about which git diff
--check complains.

oops, a silly mistake from my side.

Also, looking at this again, shouldn't the reinitialization of the
instrumentation arrays happen in ExecParallelReinitialize rather than
ExecParallelFinish, so that we don't spend time doing it unless the
Gather is actually re-executed?

Yeah, that sounds better, so modified the patch accordingly.

I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_accum_instr_parallel_workers_v3.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v3.patchDownload+56-6
#16Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#15)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.

Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.

The reason I've popped up out of nowhere on this thread to say this is
that I just realised that my nearby patch that adds support for Hash
instrumentation has the same bug, because I copied the way
ExecSortRetrieveInstrumentation() works to make
ExecHashRetrieveInstrumentation(). I initially assumed this would run
just one after the final loop in a rescan situation but on testing I
see that it runs repeatedly, and of course looses intrumentation
during rescans. I'll go and fix that. We should choose one approach
for both cases. Do you prefer a separate variable for the local copy,
or reinstating the pointer into the DSM during
ExecXXXReInitializeDSM()?

As for how to aggregate the information, isn't it reasonable to show
data from the last loop on the basis that it's representative?
Summing wouldn't make too much sense, because you didn't use that much
memory all at once.

--
Thomas Munro
http://www.enterprisedb.com

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#16)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.

Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.

Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?

--
Thomas Munro
http://www.enterprisedb.com

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#17)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.

Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.

Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?

To me, that doesn't sound like a bad option. I think if do so, then
we don't even need to reinitialize the shared sort stats. I think
something, like attached, should work if we want to go this route. We
can add regression test if this is what we think is a good idea.
Having said that, one problem I see doing thing this way is that in
general, we will display the accumulated stats of each worker, but for
sort or some other special nodes (like hash), we will show the
information of only last loop. I am not sure if that is a matter of
concern, but if we want to do this way, then probably we should
explain this in documentation as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_accum_instr_parallel_workers_v4.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v4.patchDownload+4-28
#19Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#16)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

As for how to aggregate the information, isn't it reasonable to show
data from the last loop on the basis that it's representative?
Summing wouldn't make too much sense, because you didn't use that much
memory all at once.

Sorts can be rescanned even without parallel query, so I guess we
should try to make the parallel case kinda like the non-parallel case.
If I'm not wrong, that will just use the stats from the most recent
execution (i.e. the last loop) -- see show_sort_info() in explain.c.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#15)
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, that sounds better, so modified the patch accordingly.

I committed this to master and REL_10_STABLE, but it conflicts all
over the place on 9.6.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#18)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#19)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#20)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#22)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#24)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#27)
#29Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#27)
#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#32)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#21)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#38)
#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#37)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#37)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#41)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#42)