explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
Hi,
up to parallel executions, when we had node in explain analyze showing
"loops=x" with x more than 1, it meant that the "actual time" had to be
multiplied by loops to get real time spent in a node.
For example, check step 13 in https://explain.depesz.com/s/gNBd
It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
But with parallel execution it seems to be no longer the case.
For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi
It looks that the actual time is really actual time, and loops is
"worker nodes + 1".
Is that really the case? Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?
or am I missing something in here?
Best regards,
depesz
On Fri, Nov 24, 2017 at 4:51 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
Hi,
up to parallel executions, when we had node in explain analyze showing
"loops=x" with x more than 1, it meant that the "actual time" had to be
multiplied by loops to get real time spent in a node.For example, check step 13 in https://explain.depesz.com/s/gNBd
It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
But with parallel execution it seems to be no longer the case.
For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRiIt looks that the actual time is really actual time, and loops is
"worker nodes + 1".Is that really the case?
I think so.
Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?
I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 25, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It looks that the actual time is really actual time, and loops is
"worker nodes + 1".Is that really the case?
I think so.
To add to what Amit has explained. The + 1 is for the leader (the
backend running the whole query) which also runs the parallel plan if
it gets time between fetching rows from the workers.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
For example, check step 13 in https://explain.depesz.com/s/gNBd
It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
But with parallel execution it seems to be no longer the case.
For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi
Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.
Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)
Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.
In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.
The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?
Anything "below" "Gather"?
Anything starting with "Partial?"
Anything starting with "Parallel"?
Anything with "Worker" in node "description" in explain?
depesz
On Mon, Nov 27, 2017 at 2:45 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
For example, check step 13 in https://explain.depesz.com/s/gNBd
It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
But with parallel execution it seems to be no longer the case.
For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi
Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'. Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?Anything "below" "Gather"?
I think it is "actual_time * 1" for anything below Gather.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:
I think it is "actual_time * 1" for anything below Gather.
Well, I think I found another problem.
Please take a look at:
https://explain.depesz.com/s/Bs8c
Node 15 is Gather with loops = 2974 (because it was ran by nested loop).
There were 4 workers, so 5 threads working, but the loops on parallel
seq scan is 17.698 million ?!
The problem is that for explain.depesz.com I'm calculating how much time
pg actually spent doing given thing.
So, if an operation has actual time of 1ms, but 100 loops, it took
100ms.
In case of parallel operations it looks like I can't realistically find
the information anywhere?
In the explain above, we see that Nested loop took, in total, around
2 million miliseconds (step 9).
Out of which, ~ 7000 ms was hash join and it's subnodes.
This leaves most of the time for Gather step, which was called 2974
times with actual time reported as 618ms.
2974 * 618 is 1837932 miliseconds, which seems reasonable.
But then - how much of this time was spent in Parallel Seq Scan in step
#16 ?
2974 * 609ms? Why is loops there 17 million?
Best regards,
depesz
On Mon, Nov 27, 2017 at 11:26 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:
I think it is "actual_time * 1" for anything below Gather.
Well, I think I found another problem.
Please take a look at:
https://explain.depesz.com/s/Bs8cNode 15 is Gather with loops = 2974 (because it was ran by nested loop).
There were 4 workers, so 5 threads working, but the loops on parallel
seq scan is 17.698 million ?!
The number of loops displayed on parallel seq scan seems to be wrong, see below.
The problem is that for explain.depesz.com I'm calculating how much time
pg actually spent doing given thing.So, if an operation has actual time of 1ms, but 100 loops, it took
100ms.In case of parallel operations it looks like I can't realistically find
the information anywhere?In the explain above, we see that Nested loop took, in total, around
2 million miliseconds (step 9).
Out of which, ~ 7000 ms was hash join and it's subnodes.
This leaves most of the time for Gather step, which was called 2974
times with actual time reported as 618ms.
2974 * 618 is 1837932 miliseconds, which seems reasonable.But then - how much of this time was spent in Parallel Seq Scan in step
#16 ?2974 * 609ms?
Yeah.
Why is loops there 17 million?
That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.
Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v1.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v1.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53c5254..6a6bf20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -919,14 +919,25 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* regular instrumentation information, which is the per-query context.
* Switch into per-query memory context.
*/
- oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
- planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
- MemoryContextSwitchTo(oldcontext);
+ if (!planstate->worker_instrument)
+ {
+ oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
+ ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ planstate->worker_instrument =
+ palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ MemoryContextSwitchTo(oldcontext);
+
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrInit(&planstate->worker_instrument->instrument[n],
+ planstate->state->es_instrument);
+ }
planstate->worker_instrument->num_workers = instrumentation->num_workers;
- memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);
+
+ /* Accumulate the per-worker detail. */
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrAggNode(&planstate->worker_instrument->instrument[n],
+ &instrument[n]);
/*
* Perform any node-type-specific work that needs to be done. Currently,
@@ -989,8 +1000,20 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
/* Finally, accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
+ Instrumentation *instrument;
+ SharedExecutorInstrumentation *sh_instr;
+
+ sh_instr = pei->instrumentation;
+
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ sh_instr);
+
+ /* Clear the instrumentation space for next time. */
+ instrument = GetInstrumentationArray(sh_instr);
+ for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
+ InstrInit(&instrument[i], pei->planstate->state->es_instrument);
+ }
pei->finished = true;
}
On Tue, Nov 28, 2017 at 12:53:41PM +0530, Amit Kapila wrote:
Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?
well, not really. the explains i had were posted by people on
explain.depesz.com, so i don't have original queries nor their datasets.
Best regards,
depesz
On Mon, Nov 27, 2017 at 6:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Anything "below" "Gather"?
I think it is "actual_time * 1" for anything below Gather.
The actual time amounts below Gather show total elapsed time divided
by loop count, just as they do anywhere else in the plan. The loop
count tracks the number of times the plan was executed, just as it
does anywhere else in the plan. So for example if there are 5
participants which each execute the node once, and the times spent are
5 s, 60 s, 60 s, 60 s, and 60 s, you'll get actual time = 49 s, loops
= 5.
If you want to know the total amount of time spent under the node,
then you have to multiply the actual time (49 s in this example) by
the loop count (5 in this example) just as you would for any other
plan node. However, you have to keep in mind that, for a parallel
query, the total time spent under the node is not the same as the
elapsed time. In this example, if all 5 workers started at the same
time and ran the node continuously without a break, the *elapsed* time
until they all finished would be 60 s, not 49 s, a value that EXPLAIN
will report nowhere. But they can also start and stop executing under
that node repeatedly and they need not all start at the same time,
making the concept of elapsed time a bit unclear -- earliest start to
latest finish would be one way, but that will be misleading (perhaps)
if they spend 5 seconds a piece but start staggered at 4 second
intervals.
It's really hard to understand what's actually going on with a
parallel query unless you look at each worker individually, and even
then sometimes it's not as clear as it could be. I'm not sure what to
do about that. Elsewhere, trying to show the leader's information
separately when VERBOSE is used has been discussed, and I think that's
a good idea, but it may not be enough.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?
FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 28, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.
I think so, but not 100% sure. I will give it a try and report back.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 28, 2017 at 9:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 27, 2017 at 6:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Anything "below" "Gather"?
I think it is "actual_time * 1" for anything below Gather.
The actual time amounts below Gather show total elapsed time divided
by loop count, just as they do anywhere else in the plan. The loop
count tracks the number of times the plan was executed, just as it
does anywhere else in the plan. So for example if there are 5
participants which each execute the node once, and the times spent are
5 s, 60 s, 60 s, 60 s, and 60 s, you'll get actual time = 49 s, loops
= 5.If you want to know the total amount of time spent under the node,
then you have to multiply the actual time (49 s in this example) by
the loop count (5 in this example) just as you would for any other
plan node. However, you have to keep in mind that, for a parallel
query, the total time spent under the node is not the same as the
elapsed time. In this example, if all 5 workers started at the same
time and ran the node continuously without a break, the *elapsed* time
until they all finished would be 60 s, not 49 s, a value that EXPLAIN
will report nowhere. But they can also start and stop executing under
that node repeatedly and they need not all start at the same time,
making the concept of elapsed time a bit unclear -- earliest start to
latest finish would be one way, but that will be misleading (perhaps)
if they spend 5 seconds a piece but start staggered at 4 second
intervals.It's really hard to understand what's actually going on with a
parallel query unless you look at each worker individually, and even
then sometimes it's not as clear as it could be.
I think stats at Gather node itself gives a somewhat right picture.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 29, 2017 at 2:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 28, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 28, 2017 at 2:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
That is wrong and I think you have hit a bug. It should be 2974 * 5 =
14870 as you have seen in other cases. The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans. Attached patch should fix the problem you
are seeing. I think this needs some more analysis and testing to see
if everything works in the desired way.Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?FWIW, this looks sensible to me. Not sure if there's any good way to
write a regression test for it.I think so, but not 100% sure. I will give it a try and report back.
Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v2.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v2.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53c5254..6a6bf20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -919,14 +919,25 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* regular instrumentation information, which is the per-query context.
* Switch into per-query memory context.
*/
- oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
- planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
- MemoryContextSwitchTo(oldcontext);
+ if (!planstate->worker_instrument)
+ {
+ oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
+ ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ planstate->worker_instrument =
+ palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ MemoryContextSwitchTo(oldcontext);
+
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrInit(&planstate->worker_instrument->instrument[n],
+ planstate->state->es_instrument);
+ }
planstate->worker_instrument->num_workers = instrumentation->num_workers;
- memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);
+
+ /* Accumulate the per-worker detail. */
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrAggNode(&planstate->worker_instrument->instrument[n],
+ &instrument[n]);
/*
* Perform any node-type-specific work that needs to be done. Currently,
@@ -989,8 +1000,20 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
/* Finally, accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
+ Instrumentation *instrument;
+ SharedExecutorInstrumentation *sh_instr;
+
+ sh_instr = pei->instrumentation;
+
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ sh_instr);
+
+ /* Clear the instrumentation space for next time. */
+ instrument = GetInstrumentationArray(sh_instr);
+ for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
+ InstrInit(&instrument[i], pei->planstate->state->es_instrument);
+ }
pei->finished = true;
}
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index d1d5b22..681f84a 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -378,7 +378,27 @@ select count(*) from bmscantest where a>1;
99999
(1 row)
+-- test accumulation of stats for parallel node
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+(11 rows)
+
+alter table tenk2 reset (parallel_workers);
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index bb4e34a..0c7353f 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -149,7 +149,13 @@ insert into bmscantest select r, 'fooooooooooooooooooooooooooooooooooooooooooooo
create index i_bmtest ON bmscantest(a);
select count(*) from bmscantest where a>1;
+-- test accumulation of stats for parallel node
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+alter table tenk2 reset (parallel_workers);
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
On Sat, Dec 2, 2017 at 8:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.
The regression test contains a whitespace error about which git diff
--check complains.
Also, looking at this again, shouldn't the reinitialization of the
instrumentation arrays happen in ExecParallelReinitialize rather than
ExecParallelFinish, so that we don't spend time doing it unless the
Gather is actually re-executed?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Dec 4, 2017 at 11:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Dec 2, 2017 at 8:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Attached patch contains regression test as well. Note that I have
carefully disabled all variable stats by using (analyze, timing off,
summary off, costs off) and then selected parallel sequential scan on
the right of join so that we have nloops and rows as variable stats
and those should remain constant.The regression test contains a whitespace error about which git diff
--check complains.
oops, a silly mistake from my side.
Also, looking at this again, shouldn't the reinitialization of the
instrumentation arrays happen in ExecParallelReinitialize rather than
ExecParallelFinish, so that we don't spend time doing it unless the
Gather is actually re-executed?
Yeah, that sounds better, so modified the patch accordingly.
I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v3.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v3.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53c5254..8c53774 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -799,6 +799,19 @@ ExecParallelReinitialize(PlanState *planstate,
/* Old workers must already be shut down */
Assert(pei->finished);
+ /* Clear the instrumentation space from the last round. */
+ if (pei->instrumentation)
+ {
+ Instrumentation *instrument;
+ SharedExecutorInstrumentation *sh_instr;
+ int i;
+
+ sh_instr = pei->instrumentation;
+ instrument = GetInstrumentationArray(sh_instr);
+ for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
+ InstrInit(&instrument[i], pei->planstate->state->es_instrument);
+ }
+
/* Force parameters we're going to pass to workers to be evaluated. */
ExecEvalParamExecParams(sendParams, estate);
@@ -919,14 +932,25 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* regular instrumentation information, which is the per-query context.
* Switch into per-query memory context.
*/
- oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
- planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
- MemoryContextSwitchTo(oldcontext);
+ if (!planstate->worker_instrument)
+ {
+ oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
+ ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ planstate->worker_instrument =
+ palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ MemoryContextSwitchTo(oldcontext);
+
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrInit(&planstate->worker_instrument->instrument[n],
+ planstate->state->es_instrument);
+ }
planstate->worker_instrument->num_workers = instrumentation->num_workers;
- memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);
+
+ /* Accumulate the per-worker detail. */
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ InstrAggNode(&planstate->worker_instrument->instrument[n],
+ &instrument[n]);
/*
* Perform any node-type-specific work that needs to be done. Currently,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index d1d5b22..3bb4381 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -378,7 +378,27 @@ select count(*) from bmscantest where a>1;
99999
(1 row)
+-- test accumulation of stats for parallel node
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+(11 rows)
+
+alter table tenk2 reset (parallel_workers);
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index bb4e34a..0f006f9 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -149,7 +149,13 @@ insert into bmscantest select r, 'fooooooooooooooooooooooooooooooooooooooooooooo
create index i_bmtest ON bmscantest(a);
select count(*) from bmscantest where a>1;
+-- test accumulation of stats for parallel node
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+alter table tenk2 reset (parallel_workers);
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.
Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.
The reason I've popped up out of nowhere on this thread to say this is
that I just realised that my nearby patch that adds support for Hash
instrumentation has the same bug, because I copied the way
ExecSortRetrieveInstrumentation() works to make
ExecHashRetrieveInstrumentation(). I initially assumed this would run
just one after the final loop in a rescan situation but on testing I
see that it runs repeatedly, and of course looses intrumentation
during rescans. I'll go and fix that. We should choose one approach
for both cases. Do you prefer a separate variable for the local copy,
or reinstating the pointer into the DSM during
ExecXXXReInitializeDSM()?
As for how to aggregate the information, isn't it reasonable to show
data from the last loop on the basis that it's representative?
Summing wouldn't make too much sense, because you didn't use that much
memory all at once.
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.
Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have one another observation in the somewhat related area. From the
code, it looks like we might have some problem with displaying sort
info for workers for rescans. I think the problem with the sortinfo
is that it initializes shared info with local memory in
ExecSortRetrieveInstrumentation after which it won't be able to access
the values in shared memory changed by workers in rescans. We might
be able to fix it by having some local_info same as sahred_info in
sort node. But the main problem is how do we accumulate stats for
workers across rescans. The type of sort method can change across
rescans. We might be able to accumulate the size of Memory though,
but not sure if that is right. I think though this appears to be
somewhat related to the problem being discussed in this thread, it can
be dealt separately if we want to fix it.Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for
each loop, and after the first loop we've lost track of the pointer
into shared memory because we replaced it with palloc'd copy. We
could do what you said, or we could reinstate the pointer into the DSM
in ExecSortReInitializeDSM() by looking it up in the TOC.Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?
To me, that doesn't sound like a bad option. I think if do so, then
we don't even need to reinitialize the shared sort stats. I think
something, like attached, should work if we want to go this route. We
can add regression test if this is what we think is a good idea.
Having said that, one problem I see doing thing this way is that in
general, we will display the accumulated stats of each worker, but for
sort or some other special nodes (like hash), we will show the
information of only last loop. I am not sure if that is a matter of
concern, but if we want to do this way, then probably we should
explain this in documentation as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v4.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v4.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53c5254..a60dd6f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -872,11 +872,6 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecBitmapHeapReInitializeDSM((BitmapHeapScanState *) planstate,
pcxt);
break;
- case T_SortState:
- /* even when not parallel-aware */
- ExecSortReInitializeDSM((SortState *) planstate, pcxt);
- break;
-
default:
break;
}
@@ -987,11 +982,6 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
@@ -1004,6 +994,10 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
/* Free any serialized parameters. */
if (DsaPointerIsValid(pei->param_exec))
{
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa371..d593378 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -397,23 +397,6 @@ ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
}
/* ----------------------------------------------------------------
- * ExecSortReInitializeDSM
- *
- * Reset shared state before beginning a fresh scan.
- * ----------------------------------------------------------------
- */
-void
-ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt)
-{
- /* If there's any instrumentation space, clear it for next time */
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->sinstrument, 0,
- node->shared_info->num_workers * sizeof(TuplesortInstrumentation));
- }
-}
-
-/* ----------------------------------------------------------------
* ExecSortInitializeWorker
*
* Attach worker to DSM space for sort statistics.
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index cc61a9d..627a04c 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -26,7 +26,6 @@ extern void ExecReScanSort(SortState *node);
/* parallel instrumentation support */
extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
-extern void ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeWorker(SortState *node, ParallelWorkerContext *pwcxt);
extern void ExecSortRetrieveInstrumentation(SortState *node);
On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
As for how to aggregate the information, isn't it reasonable to show
data from the last loop on the basis that it's representative?
Summing wouldn't make too much sense, because you didn't use that much
memory all at once.
Sorts can be rescanned even without parallel query, so I guess we
should try to make the parallel case kinda like the non-parallel case.
If I'm not wrong, that will just use the stats from the most recent
execution (i.e. the last loop) -- see show_sort_info() in explain.c.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Yeah, that sounds better, so modified the patch accordingly.
I committed this to master and REL_10_STABLE, but it conflicts all
over the place on 9.6.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Dec 6, 2017 at 12:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Or would it be an option to change the time
ExecXXXRetrieveInstrumentation() is called so it is run only once?To me, that doesn't sound like a bad option. I think if do so, then
we don't even need to reinitialize the shared sort stats. I think
something, like attached, should work if we want to go this route. We
can add regression test if this is what we think is a good idea.
Having said that, one problem I see doing thing this way is that in
general, we will display the accumulated stats of each worker, but for
sort or some other special nodes (like hash), we will show the
information of only last loop. I am not sure if that is a matter of
concern, but if we want to do this way, then probably we should
explain this in documentation as well.
The hash version of this code is now committed as 5bcf389e. Here is a
patch for discussion that adds some extra tests to join.sql to
exercise rescans of a hash join under a Gather node. It fails on
head, because it loses track of the instrumentation pointer after the
first loop as you described (since the Hash coding is the same is the
Sort coding), so it finishes up with no instrumentation data. If you
move ExecParallelRetrieveInstrumentation() to ExecParallelCleanup() as
you showed in your patch, then it passes. The way I'm asserting that
instrumentation data is making its way back to the leader is by
turning off leader participation and then checking if it knows how
many batches there were.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
test-hash-join-rescan-instr-v1.patchapplication/octet-stream; name=test-hash-join-rescan-instr-v1.patchDownload
From efe7d48a2944660a90975936d12596e6d7dbbc58 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 6 Dec 2017 09:59:48 +1300
Subject: [PATCH] Add a regression test for parallel hash join rescans.
*FOR DEMONSTRATION ONLY, NOT FOR COMMIT YET*
---
src/test/regress/expected/join.out | 106 +++++++++++++++++++++++++++++++++++++
src/test/regress/sql/join.sql | 60 +++++++++++++++++++++
2 files changed, 166 insertions(+)
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 001d96dc2d8..9d3abf0ed05 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -6187,6 +6187,112 @@ $$);
1 | 1
(1 row)
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table foo set (parallel_workers = 0);
+create table bar as select generate_series(1, 5000) as id, 'xxxxx'::text as t;
+alter table bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((foo.id < (b1.id + 1)) AND (foo.id > (b1.id - 1)))
+ -> Seq Scan on foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on bar b1
+ -> Hash
+ -> Seq Scan on bar b2
+(11 rows)
+
+select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((foo.id < (b1.id + 1)) AND (foo.id > (b1.id - 1)))
+ -> Seq Scan on foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on bar b1
+ -> Hash
+ -> Seq Scan on bar b2
+(11 rows)
+
+select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
rollback to settings;
-- A full outer join where every record is matched.
-- non-parallel
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 882601b3388..0e933e00d54 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2170,6 +2170,66 @@ $$
$$);
rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table foo set (parallel_workers = 0);
+create table bar as select generate_series(1, 5000) as id, 'xxxxx'::text as t;
+alter table bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from foo
+ left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
+ on foo.id < ss.id + 1 and foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
-- A full outer join where every record is matched.
-- non-parallel
--
2.15.0
On Wed, Dec 6, 2017 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:As for how to aggregate the information, isn't it reasonable to show
data from the last loop on the basis that it's representative?
Summing wouldn't make too much sense, because you didn't use that much
memory all at once.Sorts can be rescanned even without parallel query, so I guess we
should try to make the parallel case kinda like the non-parallel case.
If I'm not wrong, that will just use the stats from the most recent
execution (i.e. the last loop) -- see show_sort_info() in explain.c.
Right and seeing that I have prepared the patch (posted above [1]/messages/by-id/CAA4eK1JBj4YCEQKeTua5=BMXy7zW7zNOvoXomzBP=kb_aqMF7w@mail.gmail.com)
which fixes it such that it will resemble the non-parallel case.
Ideally, it would have obviated the need for my previous patch which
got committed as 778e78ae. However, now that is committed, I could
think of below options:
1. I shall rebase it atop what is committed and actually, I have done
that in the attached patch. I have also prepared a regression test
case patch just to show the output with and without the patch.
2. For sort node, we can fix it by having some local_info same as
shared_info in sort node and copy the shared_info in that or we could
reinstate the pointer to the DSM in ExecSortReInitializeDSM() by
looking it up in the TOC as suggested by Thomas. If we go this way,
then we need a similar fix for hash node as well.
[1]: /messages/by-id/CAA4eK1JBj4YCEQKeTua5=BMXy7zW7zNOvoXomzBP=kb_aqMF7w@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v5.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v5.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 558cb08..da43896 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -819,19 +819,6 @@ ExecParallelReinitialize(PlanState *planstate,
/* Old workers must already be shut down */
Assert(pei->finished);
- /* Clear the instrumentation space from the last round. */
- if (pei->instrumentation)
- {
- Instrumentation *instrument;
- SharedExecutorInstrumentation *sh_instr;
- int i;
-
- sh_instr = pei->instrumentation;
- instrument = GetInstrumentationArray(sh_instr);
- for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
- InstrInit(&instrument[i], pei->planstate->state->es_instrument);
- }
-
/* Force parameters we're going to pass to workers to be evaluated. */
ExecEvalParamExecParams(sendParams, estate);
@@ -909,14 +896,6 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecBitmapHeapReInitializeDSM((BitmapHeapScanState *) planstate,
pcxt);
break;
- case T_HashState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecHashReInitializeDSM((HashState *) planstate, pcxt);
- break;
- case T_SortState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecSortReInitializeDSM((SortState *) planstate, pcxt);
- break;
default:
break;
@@ -1046,11 +1025,6 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
@@ -1063,6 +1037,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
+
/* Free any serialized parameters. */
if (DsaPointerIsValid(pei->param_exec))
{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6fe5d69..afd7384 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1670,19 +1670,6 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
}
/*
- * Reset shared state before beginning a fresh scan.
- */
-void
-ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt)
-{
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->hinstrument, 0,
- node->shared_info->num_workers * sizeof(HashInstrumentation));
- }
-}
-
-/*
* Locate the DSM space for hash table instrumentation data that we'll write
* to at shutdown time.
*/
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa371..d593378 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -397,23 +397,6 @@ ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
}
/* ----------------------------------------------------------------
- * ExecSortReInitializeDSM
- *
- * Reset shared state before beginning a fresh scan.
- * ----------------------------------------------------------------
- */
-void
-ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt)
-{
- /* If there's any instrumentation space, clear it for next time */
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->sinstrument, 0,
- node->shared_info->num_workers * sizeof(TuplesortInstrumentation));
- }
-}
-
-/* ----------------------------------------------------------------
* ExecSortInitializeWorker
*
* Attach worker to DSM space for sort statistics.
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75d4c70..0974f1e 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -52,7 +52,6 @@ extern int ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
-extern void ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashRetrieveInstrumentation(HashState *node);
extern void ExecShutdownHash(HashState *node);
extern void ExecHashGetInstrumentation(HashInstrumentation *instrument,
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index cc61a9d..627a04c 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -26,7 +26,6 @@ extern void ExecReScanSort(SortState *node);
/* parallel instrumentation support */
extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
-extern void ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeWorker(SortState *node, ParallelWorkerContext *pwcxt);
extern void ExecSortRetrieveInstrumentation(SortState *node);
test_sort_stats_v1.1.patchapplication/octet-stream; name=test_sort_stats_v1.1.patchDownload
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index ff00d47..de0d0b4 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -486,6 +486,28 @@ explain (analyze, timing off, summary off, costs off)
Rows Removed by Filter: 40
(11 rows)
+explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: external sort Disk: 200kB
+ Worker 0: Sort Method: quicksort Memory: 25kB
+ Worker 1: Sort Method: quicksort Memory: 25kB
+ Worker 2: Sort Method: quicksort Memory: 25kB
+ Worker 3: Sort Method: quicksort Memory: 25kB
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
alter table tenk2 reset (parallel_workers);
reset enable_indexscan;
reset enable_hashjoin;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 1035d04..f066dd5 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -185,6 +185,10 @@ alter table tenk2 set (parallel_workers = 0);
explain (analyze, timing off, summary off, costs off)
select count(*) from tenk1, tenk2 where tenk1.hundred > 1
and tenk2.thousand=0;
+explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true;
alter table tenk2 reset (parallel_workers);
reset enable_indexscan;
On Wed, Dec 6, 2017 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Yeah, that sounds better, so modified the patch accordingly.
I committed this to master and REL_10_STABLE, but it conflicts all
over the place on 9.6.
I will try to prepare the patch for 9.6, but I think it might be
better if we first decide what to do about the open issue for sort and
hash node as there can be some overlap based on what approach we
choose to fix it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 6, 2017 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Right and seeing that I have prepared the patch (posted above [1])
which fixes it such that it will resemble the non-parallel case.
Ideally, it would have obviated the need for my previous patch which
got committed as 778e78ae. However, now that is committed, I could
think of below options:1. I shall rebase it atop what is committed and actually, I have done
that in the attached patch. I have also prepared a regression test
case patch just to show the output with and without the patch.
2. For sort node, we can fix it by having some local_info same as
shared_info in sort node and copy the shared_info in that or we could
reinstate the pointer to the DSM in ExecSortReInitializeDSM() by
looking it up in the TOC as suggested by Thomas. If we go this way,
then we need a similar fix for hash node as well.
Well, the patch you've actually attached makes the bug go away by
removing a net of 53 lines of code. The other approach would probably
add code. So I am tempted to go with the approach you have here. I
would probably change the T_HashState and T_SortState cases in
ExecParallelReInitializeDSM so that they still exist, but just do
something like this:
case T_HashState:
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
That way, it will be more clear to future readers of this code that
the lack of a reinitialize function is not an oversight, and the
compiler should optimize these cases away, merging them with the
default case.
I was a little worried when I first opened this patch that it might be
imposing a one-size-fits-all solution; just because sort and hash want
to report details from the last execution, there could be some other
executor node that wants to do otherwise. But on reflection, that's
just fine: an individual executor node is free to accumulate stats
across rescans if it wants to do so. It's merely that sort and hash
don't want to do that. In fact, it's really the current system that
is imposing a straightjacket: no matter what the individual node types
choose to do, rescans are a pile of fail.
Long story short, I like the patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Dec 8, 2017 at 7:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 6, 2017 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Right and seeing that I have prepared the patch (posted above [1])
which fixes it such that it will resemble the non-parallel case.Long story short, I like the patch.
LGTM. There might be an argument for clearing the instrumentation
every time on the basis that you might finish up keeping data from a
non-final loop when a worker opted not to do anything in the final
loop, but I'm not going to make that argument because I don't think it
matters. The patch makes the tests in
test-hash-join-rescan-instr-v1.patch pass (from my previous message).
Please also consider that test patch for commit.
--
Thomas Munro
http://www.enterprisedb.com
On Fri, Dec 8, 2017 at 12:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 6, 2017 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Right and seeing that I have prepared the patch (posted above [1])
which fixes it such that it will resemble the non-parallel case.
Ideally, it would have obviated the need for my previous patch which
got committed as 778e78ae. However, now that is committed, I could
think of below options:1. I shall rebase it atop what is committed and actually, I have done
that in the attached patch. I have also prepared a regression test
case patch just to show the output with and without the patch.
2. For sort node, we can fix it by having some local_info same as
shared_info in sort node and copy the shared_info in that or we could
reinstate the pointer to the DSM in ExecSortReInitializeDSM() by
looking it up in the TOC as suggested by Thomas. If we go this way,
then we need a similar fix for hash node as well.Well, the patch you've actually attached makes the bug go away by
removing a net of 53 lines of code. The other approach would probably
add code. So I am tempted to go with the approach you have here. I
would probably change the T_HashState and T_SortState cases in
ExecParallelReInitializeDSM so that they still exist, but just do
something like this:case T_HashState:
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;That way, it will be more clear to future readers of this code that
the lack of a reinitialize function is not an oversight, and the
compiler should optimize these cases away, merging them with the
default case.
Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v6.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v6.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 558cb08..64ca8d8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -819,19 +819,6 @@ ExecParallelReinitialize(PlanState *planstate,
/* Old workers must already be shut down */
Assert(pei->finished);
- /* Clear the instrumentation space from the last round. */
- if (pei->instrumentation)
- {
- Instrumentation *instrument;
- SharedExecutorInstrumentation *sh_instr;
- int i;
-
- sh_instr = pei->instrumentation;
- instrument = GetInstrumentationArray(sh_instr);
- for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
- InstrInit(&instrument[i], pei->planstate->state->es_instrument);
- }
-
/* Force parameters we're going to pass to workers to be evaluated. */
ExecEvalParamExecParams(sendParams, estate);
@@ -910,12 +897,8 @@ ExecParallelReInitializeDSM(PlanState *planstate,
pcxt);
break;
case T_HashState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecHashReInitializeDSM((HashState *) planstate, pcxt);
- break;
case T_SortState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecSortReInitializeDSM((SortState *) planstate, pcxt);
+ /* these nodes have DSM state, but no reinitialization is required */
break;
default:
@@ -1046,11 +1029,6 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
@@ -1063,6 +1041,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
+
/* Free any serialized parameters. */
if (DsaPointerIsValid(pei->param_exec))
{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6fe5d69..afd7384 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1670,19 +1670,6 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
}
/*
- * Reset shared state before beginning a fresh scan.
- */
-void
-ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt)
-{
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->hinstrument, 0,
- node->shared_info->num_workers * sizeof(HashInstrumentation));
- }
-}
-
-/*
* Locate the DSM space for hash table instrumentation data that we'll write
* to at shutdown time.
*/
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa371..d593378 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -397,23 +397,6 @@ ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
}
/* ----------------------------------------------------------------
- * ExecSortReInitializeDSM
- *
- * Reset shared state before beginning a fresh scan.
- * ----------------------------------------------------------------
- */
-void
-ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt)
-{
- /* If there's any instrumentation space, clear it for next time */
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->sinstrument, 0,
- node->shared_info->num_workers * sizeof(TuplesortInstrumentation));
- }
-}
-
-/* ----------------------------------------------------------------
* ExecSortInitializeWorker
*
* Attach worker to DSM space for sort statistics.
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75d4c70..0974f1e 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -52,7 +52,6 @@ extern int ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
-extern void ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashRetrieveInstrumentation(HashState *node);
extern void ExecShutdownHash(HashState *node);
extern void ExecHashGetInstrumentation(HashInstrumentation *instrument,
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index cc61a9d..627a04c 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -26,7 +26,6 @@ extern void ExecReScanSort(SortState *node);
/* parallel instrumentation support */
extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
-extern void ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeWorker(SortState *node, ParallelWorkerContext *pwcxt);
extern void ExecSortRetrieveInstrumentation(SortState *node);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index ff00d47..5323d67 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -487,12 +487,47 @@ explain (analyze, timing off, summary off, costs off)
(11 rows)
alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+ explain_parallel_sort_stats
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: quicksort Memory: xxx
+ Worker 0: Sort Method: quicksort Memory: xxx
+ Worker 1: Sort Method: quicksort Memory: xxx
+ Worker 2: Sort Method: quicksort Memory: xxx
+ Worker 3: Sort Method: quicksort Memory: xxx
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
-- test parallel merge join path.
set enable_hashjoin to off;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 1035d04..c1b2d9a 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -187,12 +187,30 @@ explain (analyze, timing off, summary off, costs off)
and tenk2.thousand=0;
alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
-- test parallel merge join path.
On Fri, Dec 8, 2017 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Dec 8, 2017 at 12:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 6, 2017 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Right and seeing that I have prepared the patch (posted above [1])
which fixes it such that it will resemble the non-parallel case.
Ideally, it would have obviated the need for my previous patch which
got committed as 778e78ae. However, now that is committed, I could
think of below options:1. I shall rebase it atop what is committed and actually, I have done
that in the attached patch. I have also prepared a regression test
case patch just to show the output with and without the patch.
2. For sort node, we can fix it by having some local_info same as
shared_info in sort node and copy the shared_info in that or we could
reinstate the pointer to the DSM in ExecSortReInitializeDSM() by
looking it up in the TOC as suggested by Thomas. If we go this way,
then we need a similar fix for hash node as well.Well, the patch you've actually attached makes the bug go away by
removing a net of 53 lines of code. The other approach would probably
add code. So I am tempted to go with the approach you have here. I
would probably change the T_HashState and T_SortState cases in
ExecParallelReInitializeDSM so that they still exist, but just do
something like this:case T_HashState:
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;That way, it will be more clear to future readers of this code that
the lack of a reinitialize function is not an oversight, and the
compiler should optimize these cases away, merging them with the
default case.Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.
The regression test added by patch needs cleanup at the end which I
have added in the attached patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v7.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v7.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 558cb08..64ca8d8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -819,19 +819,6 @@ ExecParallelReinitialize(PlanState *planstate,
/* Old workers must already be shut down */
Assert(pei->finished);
- /* Clear the instrumentation space from the last round. */
- if (pei->instrumentation)
- {
- Instrumentation *instrument;
- SharedExecutorInstrumentation *sh_instr;
- int i;
-
- sh_instr = pei->instrumentation;
- instrument = GetInstrumentationArray(sh_instr);
- for (i = 0; i < sh_instr->num_workers * sh_instr->num_plan_nodes; ++i)
- InstrInit(&instrument[i], pei->planstate->state->es_instrument);
- }
-
/* Force parameters we're going to pass to workers to be evaluated. */
ExecEvalParamExecParams(sendParams, estate);
@@ -910,12 +897,8 @@ ExecParallelReInitializeDSM(PlanState *planstate,
pcxt);
break;
case T_HashState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecHashReInitializeDSM((HashState *) planstate, pcxt);
- break;
case T_SortState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecSortReInitializeDSM((SortState *) planstate, pcxt);
+ /* these nodes have DSM state, but no reinitialization is required */
break;
default:
@@ -1046,11 +1029,6 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
@@ -1063,6 +1041,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
+
/* Free any serialized parameters. */
if (DsaPointerIsValid(pei->param_exec))
{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6fe5d69..afd7384 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1670,19 +1670,6 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
}
/*
- * Reset shared state before beginning a fresh scan.
- */
-void
-ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt)
-{
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->hinstrument, 0,
- node->shared_info->num_workers * sizeof(HashInstrumentation));
- }
-}
-
-/*
* Locate the DSM space for hash table instrumentation data that we'll write
* to at shutdown time.
*/
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa371..d593378 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -397,23 +397,6 @@ ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
}
/* ----------------------------------------------------------------
- * ExecSortReInitializeDSM
- *
- * Reset shared state before beginning a fresh scan.
- * ----------------------------------------------------------------
- */
-void
-ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt)
-{
- /* If there's any instrumentation space, clear it for next time */
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->sinstrument, 0,
- node->shared_info->num_workers * sizeof(TuplesortInstrumentation));
- }
-}
-
-/* ----------------------------------------------------------------
* ExecSortInitializeWorker
*
* Attach worker to DSM space for sort statistics.
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75d4c70..0974f1e 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -52,7 +52,6 @@ extern int ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
-extern void ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashRetrieveInstrumentation(HashState *node);
extern void ExecShutdownHash(HashState *node);
extern void ExecHashGetInstrumentation(HashInstrumentation *instrument,
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index cc61a9d..627a04c 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -26,7 +26,6 @@ extern void ExecReScanSort(SortState *node);
/* parallel instrumentation support */
extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
-extern void ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeWorker(SortState *node, ParallelWorkerContext *pwcxt);
extern void ExecSortRetrieveInstrumentation(SortState *node);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index ff00d47..dde10b5 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -487,13 +487,49 @@ explain (analyze, timing off, summary off, costs off)
(11 rows)
alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+ explain_parallel_sort_stats
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: quicksort Memory: xxx
+ Worker 0: Sort Method: quicksort Memory: xxx
+ Worker 1: Sort Method: quicksort Memory: xxx
+ Worker 2: Sort Method: quicksort Memory: xxx
+ Worker 3: Sort Method: quicksort Memory: xxx
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
+drop function explain_parallel_sort_stats();
-- test parallel merge join path.
set enable_hashjoin to off;
set enable_nestloop to off;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 1035d04..944782c 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -187,13 +187,32 @@ explain (analyze, timing off, summary off, costs off)
and tenk2.thousand=0;
alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
+drop function explain_parallel_sort_stats();
-- test parallel merge join path.
set enable_hashjoin to off;
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
The regression test added by patch needs cleanup at the end which I
have added in the attached patch.
OK, so you've got a test case now, but Thomas independently submitted
a test case patch. Which one is more awesome? :-)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Dec 9, 2017 at 6:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
The regression test added by patch needs cleanup at the end which I
have added in the attached patch.OK, so you've got a test case now, but Thomas independently submitted
a test case patch. Which one is more awesome? :-)
My test case is for Hash nodes and Amit's is for Sort nodes. Surely
both should be covered?
--
Thomas Munro
http://www.enterprisedb.com
On Fri, Dec 8, 2017 at 2:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Sat, Dec 9, 2017 at 6:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
The regression test added by patch needs cleanup at the end which I
have added in the attached patch.OK, so you've got a test case now, but Thomas independently submitted
a test case patch. Which one is more awesome? :-)My test case is for Hash nodes and Amit's is for Sort nodes. Surely
both should be covered?
Oh, yeah. I totally missed that detail, sorry.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.The regression test added by patch needs cleanup at the end which I
have added in the attached patch.
Hmm. If we're going this way, then shouldn't we revert the changes
commit 2c09a5c12a66087218c7f8cba269cd3de51b9b82 made to
ExecParallelRetrieveInstrumentation? If that function is only ever
called once, then there's no point doing InstrInit + InstrAgg node, or
checking whether worker_instrument is already initialized. We can
just palloc + memcpy as the code did previously.
I think.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Dec 9, 2017 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.The regression test added by patch needs cleanup at the end which I
have added in the attached patch.Hmm. If we're going this way, then shouldn't we revert the changes
commit 2c09a5c12a66087218c7f8cba269cd3de51b9b82 made to
ExecParallelRetrieveInstrumentation?
Yeah, it is better to revert it as ideally that is not required after
this patch and that is what I have tried to convey above ("Ideally, it
would have obviated the need for my previous patch which
got committed as 778e78ae." (The commit id is for branch 10,
otherwise, it is same as what you mention.)). I have locally reverted
that patch and then rebased it on top of that.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_v8.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_v8.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index d57cdbd..89d87d5 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -897,12 +897,8 @@ ExecParallelReInitializeDSM(PlanState *planstate,
pcxt);
break;
case T_HashState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecHashReInitializeDSM((HashState *) planstate, pcxt);
- break;
case T_SortState:
- /* even when not parallel-aware, for EXPLAIN ANALYZE */
- ExecSortReInitializeDSM((SortState *) planstate, pcxt);
+ /* these nodes have DSM state, but no reinitialization is required */
break;
default:
@@ -975,7 +971,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
/*
* Finish parallel execution. We wait for parallel workers to finish, and
- * accumulate their buffer usage and instrumentation.
+ * accumulate their buffer usage.
*/
void
ExecParallelFinish(ParallelExecutorInfo *pei)
@@ -1021,23 +1017,23 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
/*
* Clean up whatever ParallelExecutorInfo resources still exist after
- * ExecParallelFinish. We separate these routines because someone might
- * want to examine the contents of the DSM after ExecParallelFinish and
- * before calling this routine.
+ * ExecParallelFinish and accumulate instrumentation. We separate these
+ * routines because someone might want to examine the contents of the DSM
+ * after ExecParallelFinish and before calling this routine.
*/
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
+
/* Free any serialized parameters. */
if (DsaPointerIsValid(pei->param_exec))
{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6fe5d69..afd7384 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1670,19 +1670,6 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
}
/*
- * Reset shared state before beginning a fresh scan.
- */
-void
-ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt)
-{
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->hinstrument, 0,
- node->shared_info->num_workers * sizeof(HashInstrumentation));
- }
-}
-
-/*
* Locate the DSM space for hash table instrumentation data that we'll write
* to at shutdown time.
*/
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa371..d593378 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -397,23 +397,6 @@ ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt)
}
/* ----------------------------------------------------------------
- * ExecSortReInitializeDSM
- *
- * Reset shared state before beginning a fresh scan.
- * ----------------------------------------------------------------
- */
-void
-ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt)
-{
- /* If there's any instrumentation space, clear it for next time */
- if (node->shared_info != NULL)
- {
- memset(node->shared_info->sinstrument, 0,
- node->shared_info->num_workers * sizeof(TuplesortInstrumentation));
- }
-}
-
-/* ----------------------------------------------------------------
* ExecSortInitializeWorker
*
* Attach worker to DSM space for sort statistics.
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75d4c70..0974f1e 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -52,7 +52,6 @@ extern int ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
-extern void ExecHashReInitializeDSM(HashState *node, ParallelContext *pcxt);
extern void ExecHashRetrieveInstrumentation(HashState *node);
extern void ExecShutdownHash(HashState *node);
extern void ExecHashGetInstrumentation(HashInstrumentation *instrument,
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index cc61a9d..627a04c 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -26,7 +26,6 @@ extern void ExecReScanSort(SortState *node);
/* parallel instrumentation support */
extern void ExecSortEstimate(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeDSM(SortState *node, ParallelContext *pcxt);
-extern void ExecSortReInitializeDSM(SortState *node, ParallelContext *pcxt);
extern void ExecSortInitializeWorker(SortState *node, ParallelWorkerContext *pwcxt);
extern void ExecSortRetrieveInstrumentation(SortState *node);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 86a5592..7824ca5 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -465,14 +465,71 @@ select count(*) from bmscantest where a>1;
99999
(1 row)
+-- test accumulation of stats for parallel nodes
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1
+ and tenk2.thousand=0;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+(11 rows)
+
+alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+ explain_parallel_sort_stats
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: quicksort Memory: xxx
+ Worker 0: Sort Method: quicksort Memory: xxx
+ Worker 1: Sort Method: quicksort Memory: xxx
+ Worker 2: Sort Method: quicksort Memory: xxx
+ Worker 3: Sort Method: quicksort Memory: xxx
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
+drop function explain_parallel_sort_stats();
-- test parallel merge join path.
set enable_hashjoin to off;
set enable_nestloop to off;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index fb35ca3..b12ba0b 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -179,14 +179,40 @@ insert into bmscantest select r, 'fooooooooooooooooooooooooooooooooooooooooooooo
create index i_bmtest ON bmscantest(a);
select count(*) from bmscantest where a>1;
+-- test accumulation of stats for parallel nodes
reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1
+ and tenk2.thousand=0;
+alter table tenk2 reset (parallel_workers);
+
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+
reset enable_indexscan;
reset enable_hashjoin;
reset enable_mergejoin;
reset enable_material;
reset effective_io_concurrency;
-reset work_mem;
drop table bmscantest;
+drop function explain_parallel_sort_stats();
-- test parallel merge join path.
set enable_hashjoin to off;
On Sat, Dec 9, 2017 at 5:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Dec 9, 2017 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.The regression test added by patch needs cleanup at the end which I
have added in the attached patch.Hmm. If we're going this way, then shouldn't we revert the changes
commit 2c09a5c12a66087218c7f8cba269cd3de51b9b82 made to
ExecParallelRetrieveInstrumentation?Yeah, it is better to revert it as ideally that is not required after
this patch and that is what I have tried to convey above ("Ideally, it
would have obviated the need for my previous patch which
got committed as 778e78ae." (The commit id is for branch 10,
otherwise, it is same as what you mention.)). I have locally reverted
that patch and then rebased it on top of that.
Uh, should I just revert that commit entirely first, and then we can
commit the new fix afterward?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Dec 13, 2017 at 3:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Dec 9, 2017 at 5:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Dec 9, 2017 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2017 at 5:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Okay, I have adjusted the patch accordingly. I have also added a
regression test which should produce the same result across different
runs, see if that looks okay to you, then it is better to add such a
test as well.The regression test added by patch needs cleanup at the end which I
have added in the attached patch.Hmm. If we're going this way, then shouldn't we revert the changes
commit 2c09a5c12a66087218c7f8cba269cd3de51b9b82 made to
ExecParallelRetrieveInstrumentation?Yeah, it is better to revert it as ideally that is not required after
this patch and that is what I have tried to convey above ("Ideally, it
would have obviated the need for my previous patch which
got committed as 778e78ae." (The commit id is for branch 10,
otherwise, it is same as what you mention.)). I have locally reverted
that patch and then rebased it on top of that.Uh, should I just revert that commit entirely first, and then we can
commit the new fix afterward?
Yes. I have already extracted the test case of that commit to the new
patch which is what we need from that commit.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Uh, should I just revert that commit entirely first, and then we can
commit the new fix afterward?Yes. I have already extracted the test case of that commit to the new
patch which is what we need from that commit.
OK, done.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Dec 14, 2017 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Uh, should I just revert that commit entirely first, and then we can
commit the new fix afterward?Yes. I have already extracted the test case of that commit to the new
patch which is what we need from that commit.OK, done.
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.
Committed and back-patched to v10. While I was doing that, I couldn't
help wondering if this code doesn't also need to be moved:
/*
* Next, accumulate buffer usage. (This must wait for the workers to
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
It seems that it doesn't, because the way that instrumentation data
gets accumulated is via InstrStartParallelQuery and
InstrEndParallelQuery, the latter of which clobbers the entry in the
buffer_usage array rather than adding to it:
void
InstrEndParallelQuery(BufferUsage *result)
{
memset(result, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(result, &pgBufferUsage, &save_pgBufferUsage);
}
But we could think about choosing to make that work the same way; that
is, move the code block to ExecParallelCleanup, remove the memset()
call from InstrEndParallelQuery, and change the code that allocates
PARALLEL_KEY_BUFFER_USAGE to initialize the space. That would make
the handling of this more consistent with what we're now doing for the
instrumentation data, although I can't see that it fixes any live bug.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 5, 2017 at 4:23 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
The hash version of this code is now committed as 5bcf389e. Here is a
patch for discussion that adds some extra tests to join.sql to
exercise rescans of a hash join under a Gather node. It fails on
head, because it loses track of the instrumentation pointer after the
first loop as you described (since the Hash coding is the same is the
Sort coding), so it finishes up with no instrumentation data. If you
move ExecParallelRetrieveInstrumentation() to ExecParallelCleanup() as
you showed in your patch, then it passes. The way I'm asserting that
instrumentation data is making its way back to the leader is by
turning off leader participation and then checking if it knows how
many batches there were.
In a later email in this thread, you asked me to consider this patch
for commit, but it doesn't apply. I thought that might be the result
of conflicts with Amit's patch which I just committed, but I think
that's not the real explanation, because it touches the 'join'
regression test, not 'select_parallel'. Well, I thought, I'll just
find the place where the SQL should be inserted and stick it in there
-- trivial rebase, right?
Well, not really, because the context surrounding the lines you've
added seems to refer to SQL that I can't find in join.sql or anywhere
else in the tree. So my suspicion is that this patch is based on your
parallel hash patch set rather than master.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 19, 2017 at 1:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, not really, because the context surrounding the lines you've
added seems to refer to SQL that I can't find in join.sql or anywhere
else in the tree. So my suspicion is that this patch is based on your
parallel hash patch set rather than master.
Thomas pinged me off-list and showed patch -p1 < $file working fine
for him and ... I'm dumb. I was trying to apply the patch to v10, not
master. It applies fine to master; committed there.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 19, 2017 at 11:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.Committed and back-patched to v10. While I was doing that, I couldn't
help wondering if this code doesn't also need to be moved:/*
* Next, accumulate buffer usage. (This must wait for the workers to
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);It seems that it doesn't, because the way that instrumentation data
gets accumulated is via InstrStartParallelQuery and
InstrEndParallelQuery, the latter of which clobbers the entry in the
buffer_usage array rather than adding to it:
Right and that is the reason, I have not touched it in the patch.
But we could think about choosing to make that work the same way; that
is, move the code block to ExecParallelCleanup, remove the memset()
call from InstrEndParallelQuery, and change the code that allocates
PARALLEL_KEY_BUFFER_USAGE to initialize the space.
Do you mean to say initialize the first time it is allocated or both
during a first time and during rescans?
That would make
the handling of this more consistent with what we're now doing for the
instrumentation data, although I can't see that it fixes any live bug.
I think the change you are proposing makes sense to me.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 19, 2017 at 11:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.Committed and back-patched to v10.
Thanks and attached find the rebased version that can be applied to
v9.6. I have to change the test case to produce a stable output and
the reason for the change is that 9.6 doesn't have 'summary off'
option for Explain.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_accum_instr_parallel_workers_pgv96.1.patchapplication/octet-stream; name=fix_accum_instr_parallel_workers_pgv96.1.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 4741aec..6eed6db 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -536,7 +536,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
/*
* Finish parallel execution. We wait for parallel workers to finish, and
- * accumulate their buffer usage and instrumentation.
+ * accumulate their buffer usage.
*/
void
ExecParallelFinish(ParallelExecutorInfo *pei)
@@ -553,23 +553,23 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
- /* Finally, accumulate instrumentation, if any. */
- if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
-
pei->finished = true;
}
/*
- * Clean up whatever ParallelExecutorInfo resources still exist after
- * ExecParallelFinish. We separate these routines because someone might
- * want to examine the contents of the DSM after ExecParallelFinish and
- * before calling this routine.
+ * Accumulate instrumentation, and then clean up whatever ParallelExecutorInfo
+ * resources still exist after ExecParallelFinish. We separate these
+ * routines because someone might want to examine the contents of the DSM
+ * after ExecParallelFinish and before calling this routine.
*/
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
+ /* Accumulate instrumentation, if any. */
+ if (pei->instrumentation)
+ ExecParallelRetrieveInstrumentation(pei->planstate,
+ pei->instrumentation);
+
if (pei->pcxt != NULL)
{
DestroyParallelContext(pei->pcxt);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 72dd420..43801e0 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -123,6 +123,50 @@ select avg(aa::int8) from a_star;
13.6538461538461538
(1 row)
+-- test accumulation of stats for parallel nodes
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+set enable_material to off;
+alter table tenk2 set (parallel_workers = 0);
+create function explain_parallel_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, costs off)
+ select count(*) from tenk1, tenk2 where
+ tenk1.hundred > 1 and tenk2.thousand=0
+ loop
+ ln := regexp_replace(ln, 'Planning time: \S*', 'Planning time: xxx');
+ ln := regexp_replace(ln, 'Execution time: \S*', 'Execution time: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_stats();
+ explain_parallel_stats
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+ Planning time: xxx ms
+ Execution time: xxx ms
+(13 rows)
+
+reset enable_indexscan;
+reset enable_bitmapscan;
+reset enable_material;
+alter table tenk2 reset (parallel_workers);
+drop function explain_parallel_stats();
-- test the sanity of parallel query after the active role is dropped.
set force_parallel_mode=1;
drop role if exists regress_parallel_worker;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index af0cc55..7defc34 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -45,6 +45,34 @@ select avg(aa::int8) from a_star;
select avg(aa::int8) from a_star;
+-- test accumulation of stats for parallel nodes
+set enable_indexscan to off;
+set enable_bitmapscan to off;
+set enable_material to off;
+alter table tenk2 set (parallel_workers = 0);
+create function explain_parallel_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, costs off)
+ select count(*) from tenk1, tenk2 where
+ tenk1.hundred > 1 and tenk2.thousand=0
+ loop
+ ln := regexp_replace(ln, 'Planning time: \S*', 'Planning time: xxx');
+ ln := regexp_replace(ln, 'Execution time: \S*', 'Execution time: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_stats();
+reset enable_indexscan;
+reset enable_bitmapscan;
+reset enable_material;
+alter table tenk2 reset (parallel_workers);
+drop function explain_parallel_stats();
+
-- test the sanity of parallel query after the active role is dropped.
set force_parallel_mode=1;
drop role if exists regress_parallel_worker;
On Wed, Dec 20, 2017 at 3:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Dec 19, 2017 at 11:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.Committed and back-patched to v10.
Thanks and attached find the rebased version that can be applied to
v9.6. I have to change the test case to produce a stable output and
the reason for the change is that 9.6 doesn't have 'summary off'
option for Explain.
This thread got lost in my inbox over Christmas, but I've now
committed this back-port to REL9_6_STABLE.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jan 4, 2018 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 20, 2017 at 3:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Dec 19, 2017 at 11:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.Committed and back-patched to v10.
Thanks and attached find the rebased version that can be applied to
v9.6. I have to change the test case to produce a stable output and
the reason for the change is that 9.6 doesn't have 'summary off'
option for Explain.This thread got lost in my inbox over Christmas, but I've now
committed this back-port to REL9_6_STABLE.
Thanks!
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com