Gather Merge

Started by Rushabh Lathiaover 9 years ago69 messageshackers
Jump to latest
#1Rushabh Lathia
rushabh.lathia@gmail.com

Hi hackers,

Attached is the patch to implement Gather Merge. The Gather Merge node would
assume that the results from each worker are ordered with respect to each
other,
and then do a final merge pass over those. This is so that we get the
top-level
query ordering we want. The final plan for such a query would look something
like this:

Gather Merge
-> Sort
-> Parallel Seq Scan on foo
Filter: something

With this we now have a new parallel node which will always return the
sorted
results, so that any query having the pathkey can build the gather merge
path.
Currently if a query has a pathkey and we want to make it parallel-aware,
the
plan would be something like this:

Sort
-> Gather
-> Parallel Seq Scan on foo
Filter: something

With GatherMerge now it is also possible to have plan like:

Finalize GroupAggregate
-> Gather Merge
-> Partial GroupAggregate
-> Sort

With gather merge, sort can be pushed under the Gather Merge. It's valuable
as it has very good performance benefits. Consider the following test
results:

1) ./db/bin/pgbench postgres -i -F 100 -s 20
2) update pgbench_accounts set filler = 'foo' where aid%10 = 0;
3) vacuum analyze pgbench_accounts;
4) set max_parallel_workers_per_gather = 4;

Without patch:

postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=81696.51..85242.09 rows=202605 width=12) (actual
time=1037.212..1162.086 rows=200000 loops=1)
Group Key: aid
-> Sort (cost=81696.51..82203.02 rows=202605 width=8) (actual
time=1037.203..1072.446 rows=200000 loops=1)
Sort Key: aid
Sort Method: external sort Disk: 3520kB
-> Seq Scan on pgbench_accounts (cost=0.00..61066.59 rows=202605
width=8) (actual time=801.398..868.390 rows=200000 loops=1)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 1800000
Planning time: 0.133 ms
Execution time: 1171.656 ms
(10 rows)

With Patch:

postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=47274.13..56644.58 rows=202605 width=12)
(actual time=315.457..561.825 rows=200000 loops=1)
Group Key: aid
-> Gather Merge (cost=47274.13..54365.27 rows=50651 width=0) (actual
time=315.451..451.886 rows=200000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial GroupAggregate (cost=46274.09..47160.49 rows=50651
width=12) (actual time=306.830..333.908 rows=40000 loops=5)
Group Key: aid
-> Sort (cost=46274.09..46400.72 rows=50651 width=8)
(actual time=306.822..310.800 rows=40000 loops=5)
Sort Key: aid
Sort Method: quicksort Memory: 2543kB
-> Parallel Seq Scan on pgbench_accounts
(cost=0.00..42316.15 rows=50651 width=8) (actual time=237.552..255.968
rows=40000 loops=5)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 360000
Planning time: 0.200 ms
Execution time: 572.221 ms
(15 rows)

I ran the TPCH benchmark queries with the patch and found that 7 out of 22
queries end up picking the Gather Merge path.

Below benchmark numbers taken under following configuration:

- Scale factor = 10
- max_worker_processes = DEFAULT (8)
- max_parallel_workers_per_gather = 4
- Cold cache environment is ensured. With every query execution - server is
stopped and also OS caches were dropped.
- The reported values of execution time (in ms) is median of 3 executions.
- power2 machine with 512GB of RAM
- PFA performance machine cpu into (benchmark_machine_info.txt)

Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916

Here from the results we can see that query 9, 17 and 20 are the one which
show good performance benefit with the Gather Merge.

PFA tpch_result.tar.gz for the explain analysis output for TPCH queries
(with
and without patch)

I ran the TPCH benchmark queries with different number of workers and found
that
Query 18 also started picking Gather merge with worker > 6. PFA attach
TPCH_GatherMerge.pdf for the detail benchmark results.

Implementation details:

New Gather Merge node:

The patch introduces a new node type for Gather Merge. The Gather Merge
implementation is mostly similar to what Gather does. The major difference
is
that the Gather node has two TupleTableSlots; one for leader and one for the
tuple read from the worker, and Gather Merge has a TupleTableSlot per
worker,
plus one for the leader. As for Gather Merge, we need to fill every slot,
then
build a heap of the tuples and return the lowest one.

The patch generates the gather merge path from:

a) create_ordered_paths() for partial_pathlist. If the pathkey contain the
sort_pathkey, then directly create the gather merge. If not then create sort
and then create the gather merge path.

Example:

explain analyze
select * from pgbench_accounts where filler like '%foo%' order by aid;

b) create_distinct_paths(): when sort-based implementations of DISTINCT is
possible.

Example:

explain analyze
select distinct * from pgbench_accounts where filler like '%foo%' order by
aid;

c) create_grouping_paths() : While generating a complete GroupAgg Path, loop
over the partial path list and if partial path contains the group_pathkeys
generate the gather merge path.

Example:

explain analyze
select * from pgbench_accounts where filler like '%foo%' group by aid;

In all the above mentioned cases, with the patches it's giving almost a 2x
performance gain. PFA pgbench_query.out, for the explain analyze output for
the
queries.

Gather Merge reads the tuple from each queue and then picks the lowest one,
so
every time it has to read the tuple from the queue into wait mode. During
testing I found that some of the query spent some time reading tuple
from the queue. So in the patch I introduced the tuple array; once we get
the tuple into wait mode, it tries to read more tuples in nowait mode and
store it into the tuple array. Once we get one tuple through the queue,
there
are chances to have more ready tuple into queue, so just read it and, if
any,
store it to the tuple array. With this I found good performance benefits
with
some of the TPC-H complex queries.

Costing:

GatherMerge merges several pre-sorted input streams using a heap.
Considering
that costing for Gather Merge is the combination of cost_gather +
cost_merge_append.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge that
is
not possible, as we storing the tuple into Tuple array and we want tuple to
be
free only its get pass through the merge sort algorithm. One idea is, we can
also call gm_readnext_tuple() under per-tuple context (which will fix the
leak
into tqueue.c) and then store the copy of tuple into tuple array.
- Need to see a way to add the simple test as part of regression.

Thanks to my colleague Robert Haas for his help in design and some
preliminary review for the patch.

Please let me know your thought, and thanks for reading.

Regards,
Rushabh Lathia
www.EnterpriseDB.com

Attachments:

TPCH_GatherMerge.pdfapplication/pdf; name=TPCH_GatherMerge.pdfDownload
tpch_results.tar.gzapplication/x-gzip; name=tpch_results.tar.gzDownload
gather_merge_v1.patchapplication/x-download; name=gather_merge_v1.patchDownload+1293-4
benchmark_machine_info.txttext/plain; charset=US-ASCII; name=benchmark_machine_info.txtDownload
pgbench_query.outapplication/octet-stream; name=pgbench_query.outDownload
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Rushabh Lathia (#1)
Re: Gather Merge

On Wed, Oct 5, 2016 at 11:35 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Hi hackers,

Attached is the patch to implement Gather Merge.

Couple of review comments:

1.
ExecGatherMerge()
{
..
+ /* No workers? Then never mind. */
+ if (!got_any_worker
||
+ node->nreaders < 2)
+ {
+
ExecShutdownGatherMergeWorkers(node);
+ node->nreaders = 0;
+
}

Are you planning to restrict the use of gather merge based on number
of workers, if there is a valid reason, then I think comments should
be updated for same?

2.
+ExecGatherMerge(GatherMergeState * node){
..
+ if (!node->initialized)
+ {
+ EState   *estate = node->ps.state;
+
GatherMerge *gm = (GatherMerge *) node->ps.plan;
+
+ /*
+ * Sometimes we
might have to run without parallelism; but if parallel
+ * mode is active then we can try to
fire up some workers.
+ */
+ if (gm->num_workers > 0 && IsInParallelMode())
+
{
+ ParallelContext *pcxt;
+ bool got_any_worker =
false;
+
+ /* Initialize the workers required to execute Gather node. */
+
if (!node->pei)
+ node->pei = ExecInitParallelPlan(node-

ps.lefttree,

+
estate,
+
gm->num_workers);
..
}

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
 * try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
 */
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array? In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples. So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

4.
+create_gather_merge_path(..)
{
..
+  0 /* FIXME: pathnode->limit_tuples*/);

What exactly you want to fix in above code?

5.
 +/* Tuple array size */
+#define MAX_TUPLE_STORE 10

Have you tried with other values of MAX_TUPLE_STORE? If yes, then
what are the results? I think it is better to add a comment why array
size is best for performance.

6.
+/* INTERFACE ROUTINES
+ * ExecInitGatherMerge - initialize the MergeAppend
node
+ * ExecGatherMerge - retrieve the next tuple from the node
+ *
ExecEndGatherMerge - shut down the MergeAppend node
+ *
ExecReScanGatherMerge - rescan the MergeAppend node

typo. /MergeAppend/GatherMerge

7.
+static TupleTableSlot *gather_merge_getnext(GatherMergeState * gm_state);
+static HeapTuple
gm_readnext_tuple(GatherMergeState * gm_state, int nreader, bool
force, bool *done);

Formatting near GatherMergeState doesn't seem to be appropriate. I
think you need to add GatherMergeState in typedefs.list and then run
pgindent again.

8.
+ /*
+ * Initialize funnel slot to same tuple descriptor as outer plan.
+ */
+ if
(!ExecContextForcesOids(&gm_state->ps, &hasoid))

I think in above comment, you mean Initialize GatherMerge slot.

9.
+ /* Does tuple array have any avaiable tuples? */
/avaiable/available

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge that
is
not possible, as we storing the tuple into Tuple array and we want tuple to
be
free only its get pass through the merge sort algorithm. One idea is, we can
also call gm_readnext_tuple() under per-tuple context (which will fix the
leak
into tqueue.c) and then store the copy of tuple into tuple array.

Won't extra copy per tuple impact performance? Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: Gather Merge

On Mon, Oct 17, 2016 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ node->nreaders < 2)

...

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

single_copy doesn't make sense for GatherMerge, because the whole
point is to merge a bunch of individually-sorted output streams into a
single stream. If you have only one stream of tuples, you don't need
to merge anything: you could have just used Gather for that.

It does, however, make sense to merge one worker's output with the
leader's output.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#2)
Re: Gather Merge

Thanks Amit for reviewing this patch.

On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Oct 5, 2016 at 11:35 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Hi hackers,

Attached is the patch to implement Gather Merge.

Couple of review comments:

1.
ExecGatherMerge()
{
..
+ /* No workers? Then never mind. */
+ if (!got_any_worker
||
+ node->nreaders < 2)
+ {
+
ExecShutdownGatherMergeWorkers(node);
+ node->nreaders = 0;
+
}

Are you planning to restrict the use of gather merge based on number
of workers, if there is a valid reason, then I think comments should
be updated for same?

Thanks for catching this. This is left over from the earlier design patch.
With
current design we don't have any limitation for the number of worker . I did
the performance testing with setting max_parallel_workers_per_gather to 1
and didn't noticed any performance degradation. So I removed this limitation
into attached patch.

2.

+ExecGatherMerge(GatherMergeState * node){
..
+ if (!node->initialized)
+ {
+ EState   *estate = node->ps.state;
+
GatherMerge *gm = (GatherMerge *) node->ps.plan;
+
+ /*
+ * Sometimes we
might have to run without parallelism; but if parallel
+ * mode is active then we can try to
fire up some workers.
+ */
+ if (gm->num_workers > 0 && IsInParallelMode())
+
{
+ ParallelContext *pcxt;
+ bool got_any_worker =
false;
+
+ /* Initialize the workers required to execute Gather node. */
+
if (!node->pei)
+ node->pei = ExecInitParallelPlan(node-

ps.lefttree,

+
estate,
+
gm->num_workers);
..
}

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

Yes, even I thought to centrilize some things of ExecGather and
ExecGatherMerge,
but its really not something that is fixed. And I thought it might change
particularly
for the Gather Merge. And as explained by Robert single_copy doesn't make
sense
for the Gather Merge. I will still look into this to see if something can
be make
centralize.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
* try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
*/
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array? In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples. So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

Yes, you are right. First its trying to read tuple into wait-mode, and once
it
find tuple then its try to fill the tuple array (which basically try to
read tuple
into nowait-mode). Reason I keep it separate is because in case of
initializing
the gather merge, if we unable to read tuple from all the worker - while
trying
re-read, we again try to fill the tuple array for the worker who already
produced
atleast a single tuple (see gather_merge_init() for more details). Also I
thought
filling tuple array (which basically read tuple into nowait mode) and
reading tuple
(into wait-mode) are two separate task - and if its into separate function
that code
look more clear. If you have any suggestion for the function name
(fill_tuple_array)
then I am open to change that.

4.
+create_gather_merge_path(..)
{
..
+  0 /* FIXME: pathnode->limit_tuples*/);

What exactly you want to fix in above code?

Fixed.

5.
+/* Tuple array size */
+#define MAX_TUPLE_STORE 10

Have you tried with other values of MAX_TUPLE_STORE? If yes, then
what are the results? I think it is better to add a comment why array
size is best for performance.

Actually I was thinking on that, but I don't wanted to add their because
its just
performance number on my machine. Anyway I added the comments.

6.
+/* INTERFACE ROUTINES
+ * ExecInitGatherMerge - initialize the MergeAppend
node
+ * ExecGatherMerge - retrieve the next tuple from the node
+ *
ExecEndGatherMerge - shut down the MergeAppend node
+ *
ExecReScanGatherMerge - rescan the MergeAppend node

typo. /MergeAppend/GatherMerge

Fixed.

7.
+static TupleTableSlot *gather_merge_getnext(GatherMergeState * gm_state);
+static HeapTuple
gm_readnext_tuple(GatherMergeState * gm_state, int nreader, bool
force, bool *done);

Formatting near GatherMergeState doesn't seem to be appropriate. I
think you need to add GatherMergeState in typedefs.list and then run
pgindent again.

Fixed.

8.
+ /*
+ * Initialize funnel slot to same tuple descriptor as outer plan.
+ */
+ if
(!ExecContextForcesOids(&gm_state->ps, &hasoid))

I think in above comment, you mean Initialize GatherMerge slot.

No, it has to be funnel slot only - its just place holder. For Gather
Merge, initialize
one slot per worker and it is done into gather_merge_init(). I will look
into this point
to see if I can get rid of funnel slot completely.

9.

+ /* Does tuple array have any avaiable tuples? */
/avaiable/available

Fixed.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge

that

is
not possible, as we storing the tuple into Tuple array and we want tuple

to

be
free only its get pass through the merge sort algorithm. One idea is, we

can

also call gm_readnext_tuple() under per-tuple context (which will fix the
leak
into tqueue.c) and then store the copy of tuple into tuple array.

Won't extra copy per tuple impact performance? Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?

I don't think was specificially for the record or composite types - but I
might be
wrong. As per my understanding commit fix leak into tqueue.c. Fix was to add
standard to call TupleQueueReaderNext() with shorter memory context - so
that
tqueue.c doesn't leak the memory.

I have idea to fix this by calling the TupleQueueReaderNext() with
per-tuple context,
and then copy the tuple and store it to the tuple array and later with the
next run of
ExecStoreTuple() will free the earlier tuple. I will work on that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Rushabh Lathia

Attachments:

gather_merge_v2.patchapplication/x-download; name=gather_merge_v2.patchDownload+1295-4
In reply to: Rushabh Lathia (#1)
Re: Gather Merge

On Tue, Oct 4, 2016 at 11:05 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916

Here from the results we can see that query 9, 17 and 20 are the one which
show good performance benefit with the Gather Merge.

Were all other TPC-H queries unaffected? IOW, did they have the same
plan as before with your patch applied? Did you see any regressions?

I assume that this patch has each worker use work_mem for its own
sort, as with hash joins today. One concern with that model when
testing is that you could end up with a bunch of internal sorts for
cases with a GM node, where you get one big external sort for cases
without one. Did you take that into consideration?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Rushabh Lathia (#4)
Re: Gather Merge

On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

Yes, even I thought to centrilize some things of ExecGather and
ExecGatherMerge,
but its really not something that is fixed. And I thought it might change
particularly
for the Gather Merge. And as explained by Robert single_copy doesn't make
sense
for the Gather Merge. I will still look into this to see if something can be
make
centralize.

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
* try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
*/
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array? In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples. So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

Yes, you are right.

You have not answered my first question. I will try to ask again, how
the tuple read by below code is stored in the array:

+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);

First its trying to read tuple into wait-mode, and once
it
find tuple then its try to fill the tuple array (which basically try to read
tuple
into nowait-mode). Reason I keep it separate is because in case of
initializing
the gather merge, if we unable to read tuple from all the worker - while
trying
re-read, we again try to fill the tuple array for the worker who already
produced
atleast a single tuple (see gather_merge_init() for more details).

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

Also I
thought
filling tuple array (which basically read tuple into nowait mode) and
reading tuple
(into wait-mode) are two separate task - and if its into separate function
that code
look more clear.

To me that looks slightly confusing.

If you have any suggestion for the function name
(fill_tuple_array)
then I am open to change that.

form_tuple_array (form_tuple is used at many places in code, so it
should look okay). I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_isempty)
+ goto reread;
..
}

This code can cause infinite loop. Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple. I think you can fix it by checking if the
particular queue has been exhausted.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge
that
is
not possible, as we storing the tuple into Tuple array and we want tuple
to
be
free only its get pass through the merge sort algorithm. One idea is, we
can
also call gm_readnext_tuple() under per-tuple context (which will fix
the
leak
into tqueue.c) and then store the copy of tuple into tuple array.

Won't extra copy per tuple impact performance? Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?

I don't think was specificially for the record or composite types - but I
might be
wrong. As per my understanding commit fix leak into tqueue.c.

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1]/messages/by-id/32763.1469821037@sss.pgh.pa.us of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

Fix was to add
standard to call TupleQueueReaderNext() with shorter memory context - so
that
tqueue.c doesn't leak the memory.

I have idea to fix this by calling the TupleQueueReaderNext() with per-tuple
context,
and then copy the tuple and store it to the tuple array and later with the
next run of
ExecStoreTuple() will free the earlier tuple. I will work on that.

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.

[1]: /messages/by-id/32763.1469821037@sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Gather Merge

On Thu, Oct 20, 2016 at 12:22 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Oct 4, 2016 at 11:05 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916

Here from the results we can see that query 9, 17 and 20 are the one

which

show good performance benefit with the Gather Merge.

Were all other TPC-H queries unaffected? IOW, did they have the same
plan as before with your patch applied? Did you see any regressions?

Yes, all other TPC-H queries where unaffected with the patch. At the
initially stage of patch development I noticed the regressions, but then
realize that it is because I am not allowing leader to participate in the
GM. Later on I fixed that and after that I didn't noticed any regressions.

I assume that this patch has each worker use work_mem for its own

sort, as with hash joins today. One concern with that model when
testing is that you could end up with a bunch of internal sorts for
cases with a GM node, where you get one big external sort for cases
without one. Did you take that into consideration?

Yes, but isn't that good? Please correct me if I am missing anything.

--
Peter Geoghegan

--
Rushabh Lathia
www.EnterpriseDB.com

#8Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#6)
Re: Gather Merge

On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

Yes, even I thought to centrilize some things of ExecGather and
ExecGatherMerge,
but its really not something that is fixed. And I thought it might change
particularly
for the Gather Merge. And as explained by Robert single_copy doesn't make
sense
for the Gather Merge. I will still look into this to see if something

can be

make
centralize.

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.

Sure, I will look into this.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
* try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
*/
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array? In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples. So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

Yes, you are right.

You have not answered my first question. I will try to ask again, how
the tuple read by below code is stored in the array:

Tuple directly get stored into related TupleTableSlot.
In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So
tuple
read by above code is directly stored into TupleTableSlot.

+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);

First its trying to read tuple into wait-mode, and once
it
find tuple then its try to fill the tuple array (which basically try to

read

tuple
into nowait-mode). Reason I keep it separate is because in case of
initializing
the gather merge, if we unable to read tuple from all the worker - while
trying
re-read, we again try to fill the tuple array for the worker who already
produced
atleast a single tuple (see gather_merge_init() for more details).

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

Also I
thought
filling tuple array (which basically read tuple into nowait mode) and
reading tuple
(into wait-mode) are two separate task - and if its into separate

function

that code
look more clear.

To me that looks slightly confusing.

If you have any suggestion for the function name
(fill_tuple_array)
then I am open to change that.

form_tuple_array (form_tuple is used at many places in code, so it
should look okay).

Ok, I rename it with next patch.

I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way.

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_
isempty)
+ goto reread;
..
}

This code can cause infinite loop. Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple. I think you can fix it by checking if the
particular queue has been exhausted.

Oh yes. I will work on the fix and soon submit the next set of patch.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge
that
is
not possible, as we storing the tuple into Tuple array and we want

tuple

to
be
free only its get pass through the merge sort algorithm. One idea is,

we

can
also call gm_readnext_tuple() under per-tuple context (which will fix
the
leak
into tqueue.c) and then store the copy of tuple into tuple array.

Won't extra copy per tuple impact performance? Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?

I don't think was specificially for the record or composite types - but I
might be
wrong. As per my understanding commit fix leak into tqueue.c.

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1] of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

Fix was to add
standard to call TupleQueueReaderNext() with shorter memory context - so
that
tqueue.c doesn't leak the memory.

I have idea to fix this by calling the TupleQueueReaderNext() with

per-tuple

context,
and then copy the tuple and store it to the tuple array and later with

the

next run of
ExecStoreTuple() will free the earlier tuple. I will work on that.

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.

Sure, I will check the performance impact for the same.

[1] - /messages/by-id/32763.1469821037%25
40sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

#9Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#8)
Re: Gather Merge

Please find attached latest patch which fix the review point as well as
additional clean-up.

- Get rid of funnel_slot as its not needed for the Gather Merge
- renamed fill_tuple_array to form_tuple_array
- Fix possible infinite loop into gather_merge_init (Reported by Amit
Kaplia)
- Fix tqueue.c memory leak, by calling TupleQueueReaderNext() with
per-tuple context.
- Code cleanup into ExecGatherMerge.
- Added new function gather_merge_clear_slots(), which clear out all gather
merge slots and also free tuple array at end of execution.

I did the performance testing again with the latest patch and I haven't
observed any regression. Some of TPC-H queries showing additional benefit
with
the latest patch, but its just under 5%.

Do let me know if I missed anything.

On Mon, Oct 24, 2016 at 11:55 AM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

Yes, even I thought to centrilize some things of ExecGather and
ExecGatherMerge,
but its really not something that is fixed. And I thought it might

change

particularly
for the Gather Merge. And as explained by Robert single_copy doesn't

make

sense
for the Gather Merge. I will still look into this to see if something

can be

make
centralize.

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.

Sure, I will look into this.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
* try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
*/
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array? In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples. So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

Yes, you are right.

You have not answered my first question. I will try to ask again, how
the tuple read by below code is stored in the array:

Tuple directly get stored into related TupleTableSlot.
In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So
tuple
read by above code is directly stored into TupleTableSlot.

+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);

First its trying to read tuple into wait-mode, and once
it
find tuple then its try to fill the tuple array (which basically try to

read

tuple
into nowait-mode). Reason I keep it separate is because in case of
initializing
the gather merge, if we unable to read tuple from all the worker - while
trying
re-read, we again try to fill the tuple array for the worker who already
produced
atleast a single tuple (see gather_merge_init() for more details).

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

Also I
thought
filling tuple array (which basically read tuple into nowait mode) and
reading tuple
(into wait-mode) are two separate task - and if its into separate

function

that code
look more clear.

To me that looks slightly confusing.

If you have any suggestion for the function name
(fill_tuple_array)
then I am open to change that.

form_tuple_array (form_tuple is used at many places in code, so it
should look okay).

Ok, I rename it with next patch.

I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way.

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_ise
mpty)
+ goto reread;
..
}

This code can cause infinite loop. Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple. I think you can fix it by checking if the
particular queue has been exhausted.

Oh yes. I will work on the fix and soon submit the next set of patch.

Open Issue:

- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak

into

tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather

merge

that
is
not possible, as we storing the tuple into Tuple array and we want

tuple

to
be
free only its get pass through the merge sort algorithm. One idea

is, we

can
also call gm_readnext_tuple() under per-tuple context (which will fix
the
leak
into tqueue.c) and then store the copy of tuple into tuple array.

Won't extra copy per tuple impact performance? Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?

I don't think was specificially for the record or composite types - but

I

might be
wrong. As per my understanding commit fix leak into tqueue.c.

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1] of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

Fix was to add
standard to call TupleQueueReaderNext() with shorter memory context - so
that
tqueue.c doesn't leak the memory.

I have idea to fix this by calling the TupleQueueReaderNext() with

per-tuple

context,
and then copy the tuple and store it to the tuple array and later with

the

next run of
ExecStoreTuple() will free the earlier tuple. I will work on that.

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.

Sure, I will check the performance impact for the same.

[1] - /messages/by-id/32763.1469821037@sss
.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

--
Rushabh Lathia

Attachments:

gather_merge_v3.patchapplication/x-download; name=gather_merge_v3.patchDownload+1323-4
#10Thomas Munro
thomas.munro@gmail.com
In reply to: Rushabh Lathia (#9)
Re: Gather Merge

On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Please find attached latest patch which fix the review point as well as
additional clean-up.

I've signed up to review this patch and I'm planning to do some
testing. Here's some initial feedback after a quick read-through:

+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))

Clunky ternary operator... how about "!initialize".

+/*
+ * Function clear out a slot in the tuple table for each gather merge
+ * slots and returns the clear clear slot.
+ */

Maybe better like this? "_Clear_ out a slot in the tuple table for
each gather merge _slot_ and _return_ the _cleared_ slot."

+ /* Free tuple array as we no more need it */

"... as we don't need it any more"

+/*
+ * Read the next tuple for gather merge.
+ *
+ * Function fetch the sorted tuple out of the heap.
+ */

"_Fetch_ the sorted tuple out of the heap."

+ * Otherwise, pull the next tuple from whichever participate we
+ * returned from last time, and reinsert the index into the heap,
+ * because it might now compare differently against the existing

s/participate/participant/

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? Are we supposed to be blaming the University of California
for new files?

+#include "executor/tqueue.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "lib/binaryheap.h"

Not correctly sorted.

+ /*
+ * store the tuple descriptor into gather merge state, so we can use it
+ * later while initilizing the gather merge slots.
+ */

s/initilizing/initializing/

+/* ----------------------------------------------------------------
+ * ExecEndGatherMerge
+ *
+ * frees any storage allocated through C routines.
+ * ----------------------------------------------------------------

The convention in Postgres code seems to be to use a form like "Free
any storage ..." in function documentation. Not sure if that's an
imperative, an infinitive, or if the word "we" is omitted since
English is so fuzzy like that, but it's inconsistent with other
documentation to use "frees" here. Oh, I see that exact wording is in
several other files. I guess I'll just leave this as a complaint
about all those files then :-)

+ * Pull atleast single tuple from each worker + leader and set up the heap.

s/atleast single/at least a single/

+ * Read the tuple for given reader into nowait mode, and form the tuple array.

s/ into / in /

+ * Function attempt to read tuple for the given reader and store it into reader

s/Function attempt /Attempt /

+ * Function returns true if found tuple for the reader, otherwise returns

s/Function returns /Return /

+ * First try to read tuple for each worker (including leader) into nowait
+ * mode, so that we initialize read from each worker as well as leader.

I wonder if it would be good to standardise on the terminology we use
when we mean workers AND the leader. In my Parallel Shared Hash work,
I've been saying 'participants' if I mean = workers + leader. What do
you think?

+ * After this if all active worker unable to produce the tuple, then
+ * re-read and this time read the tuple into wait mode. For the worker,
+ * which was able to produced single tuple in the earlier loop and still
+ * active, just try fill the tuple array if more tuples available.
+ */

How about this? "After this, if all active workers are unable to
produce a tuple, then re-read and this time us wait mode. For workers
that were able to produce a tuple in the earlier loop and are still
active, just try to fill the tuple array if more tuples are
available."

+ * The heap is never spilled to disk, since we assume N is not very large. So
+ * this is much simple then cost_sort.

s/much simple then/much simpler than/

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

Also, shouldn't we add 1 to N to account for the leader? Suppose
there are 2 workers. There are 3 elements in the binary heap. The
element to be sifted down must be compared against either 1 or 2
others to reorganise the heap. Surely in that case we should estimate
log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.

I suspect that the leader's contribution will be equivalent to a whole
worker if the plan involves a sort: as soon as the leader pulls a
tuple in gather_merge_init, the sort node will pull all the tuples it
can in a tight loop. It's unfortunate that cost_seqscan has to
estimate what the leader's contribution will be without knowing
whether it has a "greedy" high-startup-cost consumer like a sort or
hash node where the leader will contribute a whole backend's full
attention as soon as it executes the plan, or a lazy consumer where
the leader will probably not contribute much if there are enough
workers to keep it distracted. In the case of a Gather Merge -> Sort
-> Parallel Seq Scan plan, I think we will overestimate the number of
rows (per participant), because cost_seqscan will guess that the
leader is spending 30% of its time per worker servicing the workers,
when in fact it will be sucking tuples into a sort node just as fast
as anyone else. But I don't see what this patch can do about that...

+ * When force is true, function reads the tuple into wait mode. For gather
+ * merge we need to fill the slot from which we returned the earlier tuple, so
+ * this require tuple to be read into wait mode. During initialization phase,
+ * once we try to read the tuple into no-wait mode as we want to initialize all
+ * the readers. Refer gather_merge_init() for more details.
+ *
+ * Function returns true if found tuple for the reader, otherwise returns
+ * false.
+ */
+static bool
+gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force)

s/into wait mode/in wait mode/

This appears throughout the comments; not sure if I can explain this
well but "in wait mode" describes a state of being which is wanted
here, "into wait mode" describes some kind of change or movement or
insertion.

Perhaps it would be better to say "reads the tuple _queue_ in wait
mode", just to make clearer that this is talking about the wait/nowait
feature of tuple queues, and perhaps also note that the leader always
waits since it executes the plan.

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue
interface? Why introduce another terminology for the same thing with
inverted sense?

+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named. How about try_to_fill_tuple_buffer
or something?

+ GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];

I wonder if the purpose of gm_tuple, would be clearer if it were
called gm_tuple_buffers. Plural because it holds one buffer per
reader. Then in that variable on the left hand side there could be
called tuple_buffer (singular), because it's the buffer of tuples for
one single reader.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#10)
Re: Gather Merge

On Fri, Nov 4, 2016 at 12:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? Are we supposed to be blaming the University of California
for new files?

If the new file contains a portion of code from this age, yes. If
that's something completely new using only PGDG is fine. At least
that's what I can conclude by looking at git log -p and search for
"new file mode".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#10)
Re: Gather Merge

On Thu, Nov 3, 2016 at 11:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append, and the header comments threreto.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#11)
Re: Gather Merge

Michael Paquier <michael.paquier@gmail.com> writes:

On Fri, Nov 4, 2016 at 12:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? Are we supposed to be blaming the University of California
for new files?

If the new file contains a portion of code from this age, yes.

My habit has been to include the whole old copyright if there's anything
at all in the new file that could be considered to be copy-and-paste from
an existing file. Frequently it's a gray area.

Legally, I doubt anyone cares much. Morally, I see it as paying due
respect to those who came before us in this project.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#12)
Re: Gather Merge

On Sat, Nov 5, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 3, 2016 at 11:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append, and the header comments threreto.

I see. So commit 7a2fe9bd got rid of the delete/insert code
(heap_siftup_slot and heap_insert_slot) and introduced
binaryheap_replace_first which does it in one step, but the costing
wasn't adjusted and still thinks we pay comparison_cost * logN twice.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#13)
Re: Gather Merge

On Sat, Nov 5, 2016 at 2:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Fri, Nov 4, 2016 at 12:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? Are we supposed to be blaming the University of California
for new files?

If the new file contains a portion of code from this age, yes.

My habit has been to include the whole old copyright if there's anything
at all in the new file that could be considered to be copy-and-paste from
an existing file. Frequently it's a gray area.

Thanks. I see that it's warranted in this case, as code is recycled
from MergeAppend.

Legally, I doubt anyone cares much. Morally, I see it as paying due
respect to those who came before us in this project.

+1

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#10)
Re: Gather Merge

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Please find attached latest patch which fix the review point as well as
additional clean-up.

+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named. How about try_to_fill_tuple_buffer
or something?

Hmm. We have discussed upthread to name it as form_tuple_array. Now,
you feel that is also not good, I think it is basically matter of
perspective, so why not leave it as it is for now and we will come
back to naming it towards end of patch review or may be we can leave
it for committer to decide.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Thomas Munro (#10)
Re: Gather Merge

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Please find attached latest patch which fix the review point as well as
additional clean-up.

I've signed up to review this patch and I'm planning to do some
testing. Here's some initial feedback after a quick read-through:

Thanks Thomas.

+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))

Clunky ternary operator... how about "!initialize".

Fixed.

+/*
+ * Function clear out a slot in the tuple table for each gather merge
+ * slots and returns the clear clear slot.
+ */

Maybe better like this? "_Clear_ out a slot in the tuple table for
each gather merge _slot_ and _return_ the _cleared_ slot."

Fixed.

+ /* Free tuple array as we no more need it */

"... as we don't need it any more"

Fixed

+/*
+ * Read the next tuple for gather merge.
+ *
+ * Function fetch the sorted tuple out of the heap.
+ */

"_Fetch_ the sorted tuple out of the heap."

Fixed

+ * Otherwise, pull the next tuple from whichever participate we
+ * returned from last time, and reinsert the index into the heap,
+ * because it might now compare differently against the existing

s/participate/participant/

Fixed.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"?

Fixed.

Are we supposed to be blaming the University of California
for new files?

Not quite sure about this, so keeping this as it is.

+#include "executor/tqueue.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "lib/binaryheap.h"

Not correctly sorted.

Copied from nodeGather.c. But Fixed here.

+ /*
+ * store the tuple descriptor into gather merge state, so we can use it
+ * later while initilizing the gather merge slots.
+ */

s/initilizing/initializing/

Fixed.

+/* ----------------------------------------------------------------
+ * ExecEndGatherMerge
+ *
+ * frees any storage allocated through C routines.
+ * ----------------------------------------------------------------

The convention in Postgres code seems to be to use a form like "Free
any storage ..." in function documentation. Not sure if that's an
imperative, an infinitive, or if the word "we" is omitted since
English is so fuzzy like that, but it's inconsistent with other
documentation to use "frees" here. Oh, I see that exact wording is in
several other files. I guess I'll just leave this as a complaint
about all those files then :-)

Sure.

+ * Pull atleast single tuple from each worker + leader and set up the
heap.

s/atleast single/at least a single/

Fixed.

+ * Read the tuple for given reader into nowait mode, and form the tuple
array.

s/ into / in /

Fixed.

+ * Function attempt to read tuple for the given reader and store it into
reader

s/Function attempt /Attempt /

Fixed.

+ * Function returns true if found tuple for the reader, otherwise returns

s/Function returns /Return /

Fixed.

+ * First try to read tuple for each worker (including leader) into nowait
+ * mode, so that we initialize read from each worker as well as leader.

I wonder if it would be good to standardise on the terminology we use
when we mean workers AND the leader. In my Parallel Shared Hash work,
I've been saying 'participants' if I mean = workers + leader. What do
you think?

I am not quite sure about participants. In my options when we explicitly
say workers + leader its more clear. I am open to change is if committer
thinks otherwise.

+ * After this if all active worker unable to produce the tuple, then
+ * re-read and this time read the tuple into wait mode. For the worker,
+ * which was able to produced single tuple in the earlier loop and still
+ * active, just try fill the tuple array if more tuples available.
+ */

How about this? "After this, if all active workers are unable to
produce a tuple, then re-read and this time us wait mode. For workers
that were able to produce a tuple in the earlier loop and are still
active, just try to fill the tuple array if more tuples are
available."

Fixed.

+ * The heap is never spilled to disk, since we assume N is not very
large. So
+ * this is much simple then cost_sort.

s/much simple then/much simpler than/

Fixed.

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append.

Also, shouldn't we add 1 to N to account for the leader? Suppose
there are 2 workers. There are 3 elements in the binary heap. The
element to be sifted down must be compared against either 1 or 2
others to reorganise the heap. Surely in that case we should estimate
log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.

Yes, good catch. For Gather Merge leader always participate, so
we should num_workers + 1.

I suspect that the leader's contribution will be equivalent to a whole
worker if the plan involves a sort: as soon as the leader pulls a
tuple in gather_merge_init, the sort node will pull all the tuples it
can in a tight loop. It's unfortunate that cost_seqscan has to
estimate what the leader's contribution will be without knowing
whether it has a "greedy" high-startup-cost consumer like a sort or
hash node where the leader will contribute a whole backend's full
attention as soon as it executes the plan, or a lazy consumer where
the leader will probably not contribute much if there are enough
workers to keep it distracted. In the case of a Gather Merge -> Sort
-> Parallel Seq Scan plan, I think we will overestimate the number of
rows (per participant), because cost_seqscan will guess that the
leader is spending 30% of its time per worker servicing the workers,
when in fact it will be sucking tuples into a sort node just as fast
as anyone else. But I don't see what this patch can do about that...

Exactly. There is very thin line - when it comes to calculating the cost.
In general, while calculating the cost for GM, I just tried to be similar
to the Gather + MergeAppend.

+ * When force is true, function reads the tuple into wait mode. For gather
+ * merge we need to fill the slot from which we returned the earlier
tuple, so
+ * this require tuple to be read into wait mode. During initialization
phase,
+ * once we try to read the tuple into no-wait mode as we want to
initialize all
+ * the readers. Refer gather_merge_init() for more details.
+ *
+ * Function returns true if found tuple for the reader, otherwise returns
+ * false.
+ */
+static bool
+gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force)

s/into wait mode/in wait mode/

This appears throughout the comments; not sure if I can explain this
well but "in wait mode" describes a state of being which is wanted
here, "into wait mode" describes some kind of change or movement or
insertion.

Perhaps it would be better to say "reads the tuple _queue_ in wait
mode", just to make clearer that this is talking about the wait/nowait
feature of tuple queues, and perhaps also note that the leader always
waits since it executes the plan.

Fixed. Just choose to s/into wait mode/in wait mode/

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue

interface? Why introduce another terminology for the same thing with
inverted sense?

Agree with you. Changed the function gm_readnext_tuple() &
gather_merge_readnext()
APIs.

+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple
array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named. How about try_to_fill_tuple_buffer
or something?

+ GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];

I wonder if the purpose of gm_tuple, would be clearer if it were
called gm_tuple_buffers. Plural because it holds one buffer per
reader. Then in that variable on the left hand side there could be
called tuple_buffer (singular), because it's the buffer of tuples for
one single reader.

Yes, you are right. I renamed the variable as well as structure.

PFA latest patch which address the review comments as
well as few other clean ups.

Apart from this my colleague Rafia Sabih reported one regression with
GM. Which was like, if we set work_mem enough to accommodate the
sort operation - in such case GM path get select even though Sort
performs much better.

Example:

create table t (i int);
insert into t values(generate_series(1,10000000));
set work_mem =1024000;
explain analyze select * from t order by i;
set enable_gathermerge =off;
explain analyze select * from t order by i;

postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=335916.26..648415.76 rows=2499996 width=4)
(actual time=2234.145..7628.555 rows=10000000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=334916.22..341166.21 rows=2499996 width=4) (actual
time=2226.609..2611.041 rows=2000000 loops=5)
Sort Key: i
Sort Method: quicksort Memory: 147669kB
-> Parallel Seq Scan on t (cost=0.00..69247.96 rows=2499996
width=4) (actual time=0.034..323.129 rows=2000000 loops=5)
Planning time: 0.061 ms
Execution time: 8143.809 ms
(9 rows)

postgres=# set enable_gathermerge = off;
SET
postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Sort (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual
time=3521.143..4854.148 rows=10000000 loops=1)
Sort Key: i
Sort Method: quicksort Memory: 854075kB
-> Seq Scan on t (cost=0.00..144247.85 rows=9999985 width=4)
(actual time=0.113..1340.758 rows=10000000 loops=1)
Planning time: 0.100 ms
Execution time: 5535.560 ms
(6 rows)

Looking at the plan I realize that this is happening because wrong costing
for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

#18Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#17)
Re: Gather Merge

Oops forgot to attach latest patch in the earlier mail.

On Fri, Nov 11, 2016 at 6:26 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:

On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Please find attached latest patch which fix the review point as well as
additional clean-up.

I've signed up to review this patch and I'm planning to do some
testing. Here's some initial feedback after a quick read-through:

Thanks Thomas.

+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))

Clunky ternary operator... how about "!initialize".

Fixed.

+/*
+ * Function clear out a slot in the tuple table for each gather merge
+ * slots and returns the clear clear slot.
+ */

Maybe better like this? "_Clear_ out a slot in the tuple table for
each gather merge _slot_ and _return_ the _cleared_ slot."

Fixed.

+ /* Free tuple array as we no more need it */

"... as we don't need it any more"

Fixed

+/*
+ * Read the next tuple for gather merge.
+ *
+ * Function fetch the sorted tuple out of the heap.
+ */

"_Fetch_ the sorted tuple out of the heap."

Fixed

+ * Otherwise, pull the next tuple from whichever participate we
+ * returned from last time, and reinsert the index into the heap,
+ * because it might now compare differently against the existing

s/participate/participant/

Fixed.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"?

Fixed.

Are we supposed to be blaming the University of California
for new files?

Not quite sure about this, so keeping this as it is.

+#include "executor/tqueue.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "lib/binaryheap.h"

Not correctly sorted.

Copied from nodeGather.c. But Fixed here.

+ /*
+ * store the tuple descriptor into gather merge state, so we can use it
+ * later while initilizing the gather merge slots.
+ */

s/initilizing/initializing/

Fixed.

+/* ----------------------------------------------------------------
+ * ExecEndGatherMerge
+ *
+ * frees any storage allocated through C routines.
+ * ----------------------------------------------------------------

The convention in Postgres code seems to be to use a form like "Free
any storage ..." in function documentation. Not sure if that's an
imperative, an infinitive, or if the word "we" is omitted since
English is so fuzzy like that, but it's inconsistent with other
documentation to use "frees" here. Oh, I see that exact wording is in
several other files. I guess I'll just leave this as a complaint
about all those files then :-)

Sure.

+ * Pull atleast single tuple from each worker + leader and set up the
heap.

s/atleast single/at least a single/

Fixed.

+ * Read the tuple for given reader into nowait mode, and form the tuple
array.

s/ into / in /

Fixed.

+ * Function attempt to read tuple for the given reader and store it into
reader

s/Function attempt /Attempt /

Fixed.

+ * Function returns true if found tuple for the reader, otherwise returns

s/Function returns /Return /

Fixed.

+ * First try to read tuple for each worker (including leader) into nowait
+ * mode, so that we initialize read from each worker as well as leader.

I wonder if it would be good to standardise on the terminology we use
when we mean workers AND the leader. In my Parallel Shared Hash work,
I've been saying 'participants' if I mean = workers + leader. What do
you think?

I am not quite sure about participants. In my options when we explicitly
say workers + leader its more clear. I am open to change is if committer
thinks otherwise.

+ * After this if all active worker unable to produce the tuple, then
+ * re-read and this time read the tuple into wait mode. For the worker,
+ * which was able to produced single tuple in the earlier loop and still
+ * active, just try fill the tuple array if more tuples available.
+ */

How about this? "After this, if all active workers are unable to
produce a tuple, then re-read and this time us wait mode. For workers
that were able to produce a tuple in the earlier loop and are still
active, just try to fill the tuple array if more tuples are
available."

Fixed.

+ * The heap is never spilled to disk, since we assume N is not very
large. So
+ * this is much simple then cost_sort.

s/much simple then/much simpler than/

Fixed.

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append.

Also, shouldn't we add 1 to N to account for the leader? Suppose
there are 2 workers. There are 3 elements in the binary heap. The
element to be sifted down must be compared against either 1 or 2
others to reorganise the heap. Surely in that case we should estimate
log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.

Yes, good catch. For Gather Merge leader always participate, so
we should num_workers + 1.

I suspect that the leader's contribution will be equivalent to a whole
worker if the plan involves a sort: as soon as the leader pulls a
tuple in gather_merge_init, the sort node will pull all the tuples it
can in a tight loop. It's unfortunate that cost_seqscan has to
estimate what the leader's contribution will be without knowing
whether it has a "greedy" high-startup-cost consumer like a sort or
hash node where the leader will contribute a whole backend's full
attention as soon as it executes the plan, or a lazy consumer where
the leader will probably not contribute much if there are enough
workers to keep it distracted. In the case of a Gather Merge -> Sort
-> Parallel Seq Scan plan, I think we will overestimate the number of
rows (per participant), because cost_seqscan will guess that the
leader is spending 30% of its time per worker servicing the workers,
when in fact it will be sucking tuples into a sort node just as fast
as anyone else. But I don't see what this patch can do about that...

Exactly. There is very thin line - when it comes to calculating the cost.
In general, while calculating the cost for GM, I just tried to be similar
to the Gather + MergeAppend.

+ * When force is true, function reads the tuple into wait mode. For
gather
+ * merge we need to fill the slot from which we returned the earlier
tuple, so
+ * this require tuple to be read into wait mode. During initialization
phase,
+ * once we try to read the tuple into no-wait mode as we want to
initialize all
+ * the readers. Refer gather_merge_init() for more details.
+ *
+ * Function returns true if found tuple for the reader, otherwise returns
+ * false.
+ */
+static bool
+gather_merge_readnext(GatherMergeState *gm_state, int reader, bool
force)

s/into wait mode/in wait mode/

This appears throughout the comments; not sure if I can explain this
well but "in wait mode" describes a state of being which is wanted
here, "into wait mode" describes some kind of change or movement or
insertion.

Perhaps it would be better to say "reads the tuple _queue_ in wait
mode", just to make clearer that this is talking about the wait/nowait
feature of tuple queues, and perhaps also note that the leader always
waits since it executes the plan.

Fixed. Just choose to s/into wait mode/in wait mode/

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue

interface? Why introduce another terminology for the same thing with
inverted sense?

Agree with you. Changed the function gm_readnext_tuple() &
gather_merge_readnext()
APIs.

+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple
array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named. How about try_to_fill_tuple_buffer
or something?

+ GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];

I wonder if the purpose of gm_tuple, would be clearer if it were
called gm_tuple_buffers. Plural because it holds one buffer per
reader. Then in that variable on the left hand side there could be
called tuple_buffer (singular), because it's the buffer of tuples for
one single reader.

Yes, you are right. I renamed the variable as well as structure.

PFA latest patch which address the review comments as
well as few other clean ups.

Apart from this my colleague Rafia Sabih reported one regression with
GM. Which was like, if we set work_mem enough to accommodate the
sort operation - in such case GM path get select even though Sort
performs much better.

Example:

create table t (i int);
insert into t values(generate_series(1,10000000));
set work_mem =1024000;
explain analyze select * from t order by i;
set enable_gathermerge =off;
explain analyze select * from t order by i;

postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=335916.26..648415.76 rows=2499996 width=4) (actual time=2234.145..7628.555 rows=10000000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=334916.22..341166.21 rows=2499996 width=4) (actual time=2226.609..2611.041 rows=2000000 loops=5)
Sort Key: i
Sort Method: quicksort Memory: 147669kB
-> Parallel Seq Scan on t (cost=0.00..69247.96 rows=2499996 width=4) (actual time=0.034..323.129 rows=2000000 loops=5)
Planning time: 0.061 ms
Execution time: 8143.809 ms
(9 rows)

postgres=# set enable_gathermerge = off;
SET
postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Sort (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual time=3521.143..4854.148 rows=10000000 loops=1)
Sort Key: i
Sort Method: quicksort Memory: 854075kB
-> Seq Scan on t (cost=0.00..144247.85 rows=9999985 width=4) (actual time=0.113..1340.758 rows=10000000 loops=1)
Planning time: 0.100 ms
Execution time: 5535.560 ms
(6 rows)

Looking at the plan I realize that this is happening because wrong costing
for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

--
Rushabh Lathia

Attachments:

gather_merge_v4.patchapplication/x-download; name=gather_merge_v4.patchDownload+1334-4
#19Thomas Munro
thomas.munro@gmail.com
In reply to: Rushabh Lathia (#17)
Re: Gather Merge

On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"?

Fixed.

The year also needs updating to 2016 in nodeGatherMerge.h.

+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append.

That just got tweaked in commit 34ca0905.

Looking at the plan I realize that this is happening because wrong costing
for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.

In create_grouping_paths:
+ double total_groups = gmpath->rows *
gmpath->parallel_workers;

This hides a variable of the same name in the enclosing scope. Maybe confusing?

In some other places like create_ordered_paths:
+ double total_groups = path->rows * path->parallel_workers;

Though it probably made sense to use this variable name in
create_grouping_paths, wouldn't total_rows be better here?

It feels weird to be working back to a total row count estimate from
the partial one by simply multiplying by path->parallel_workers.
Gather Merge will underestimate the total rows when parallel_workers <
4, if using partial row estimates ultimately from cost_seqscan which
assume some leader contribution. I don't have a better idea though.
Reversing cost_seqscan's logic certainly doesn't seem right. I don't
know how to make them agree on the leader's contribution AND give
principled answers, since there seems to be some kind of cyclic
dependency in the costing logic (cost_seqscan really needs to be given
a leader contribution estimate from its superpath which knows whether
it will allow the leader to pull tuples greedily/fairly or not, but
that superpath hasn't been created yet; cost_gather_merge needs the
row count from its subpath). Or maybe I'm just confused.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Thomas Munro (#19)
Re: Gather Merge

On Mon, Nov 14, 2016 at 3:51 PM, Thomas Munro <thomas.munro@enterprisedb.com

wrote:

On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <

thomas.munro@enterprisedb.com>

wrote:

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development

Group

+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"?

Fixed.

The year also needs updating to 2016 in nodeGatherMerge.h.

Oops sorry, fixed now.

+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append.

That just got tweaked in commit 34ca0905.

Fixed.

Looking at the plan I realize that this is happening because wrong

costing

for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.

In create_grouping_paths:
+ double total_groups = gmpath->rows *
gmpath->parallel_workers;

This hides a variable of the same name in the enclosing scope. Maybe
confusing?

In some other places like create_ordered_paths:
+ double total_groups = path->rows * path->parallel_workers;

Though it probably made sense to use this variable name in
create_grouping_paths, wouldn't total_rows be better here?

Initially I just copied from the other places. I agree with you that
create_ordered_paths - total_rows make more sense.

It feels weird to be working back to a total row count estimate from
the partial one by simply multiplying by path->parallel_workers.
Gather Merge will underestimate the total rows when parallel_workers <
4, if using partial row estimates ultimately from cost_seqscan which
assume some leader contribution. I don't have a better idea though.
Reversing cost_seqscan's logic certainly doesn't seem right. I don't
know how to make them agree on the leader's contribution AND give
principled answers, since there seems to be some kind of cyclic
dependency in the costing logic (cost_seqscan really needs to be given
a leader contribution estimate from its superpath which knows whether
it will allow the leader to pull tuples greedily/fairly or not, but
that superpath hasn't been created yet; cost_gather_merge needs the
row count from its subpath). Or maybe I'm just confused.

Yes, I agree with you. But we can't really do changes into cost_seqscan.
Another option I can think of is just calculate the rows for gather merge,
by using the reverse formula which been used into cost_seqscan. So
we can completely remote the rows argument from the
create_gather_merge_path()
and then inside create_gather_merge_path() - calculate the total_rows
using same formula which been used into cost_seqscan. This is working
fine - but not quite sure about the approach. So I attached that part of
changes
as separate patch. Any suggestions?

--
Rushabh Lathia
www.EnterpriseDB.com

Attachments:

gather_merge_v4_minor_changes.patchapplication/x-download; name=gather_merge_v4_minor_changes.patchDownload+1333-4
gm_v4_plus_rows_estimate.patchapplication/x-download; name=gm_v4_plus_rows_estimate.patchDownload+19-13
#21Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#20)
#22Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Rushabh Lathia (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#22)
#24Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#23)
#25Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#24)
#26Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#25)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Rushabh Lathia (#25)
In reply to: Rushabh Lathia (#26)
#29Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#27)
#30Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Rushabh Lathia (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Kuntal Ghosh (#30)
#32Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Michael Paquier (#31)
#33Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#32)
#34Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Rushabh Lathia (#33)
#35Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Neha Sharma (#34)
#36Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Neha Sharma (#34)
#37Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#36)
#38Thomas Munro
thomas.munro@gmail.com
In reply to: Rushabh Lathia (#33)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#38)
#40Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Thomas Munro (#38)
#41Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#39)
#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Rushabh Lathia (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#42)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#43)
#45Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#44)
#46Dilip Kumar
dilipbalaut@gmail.com
In reply to: Rushabh Lathia (#45)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#45)
#49Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#49)
#51Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#51)
#53Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#52)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#53)
#55Andreas Joseph Krogh
andreas@visena.com
In reply to: Robert Haas (#54)
#56Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Andreas Joseph Krogh (#55)
#57Andreas Joseph Krogh
andreas@visena.com
In reply to: Rushabh Lathia (#56)
#58Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Andreas Joseph Krogh (#57)
#59Andreas Joseph Krogh
andreas@visena.com
In reply to: Rushabh Lathia (#58)
#60Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Andreas Joseph Krogh (#59)
#61Andreas Joseph Krogh
andreas@visena.com
In reply to: Rushabh Lathia (#60)
#62Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Rushabh Lathia (#60)
#63Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Kuntal Ghosh (#62)
#64Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#64)
#66Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#66)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#67)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#68)