[PATCH] Incremental sort (was: PoC: Partial sort)
Hi all!
I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestion by
Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.
Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsible for
incremental sort: IncSortPath and IncSort which are inherited from SortPath
and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control planner
ability to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to push down.
I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.
Also patch has set of assorted code and comments improvements.
Links
1.
/messages/by-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
2.
/messages/by-id/CAM3SWZQL4yD2SnDheMCGL0Q2b2oTdKUvv_L6Zg_FcGoLuwMffg@mail.gmail.com
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-1.patchapplication/octet-stream; name=incremental-sort-1.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 0b9e3e4..408e14d
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1803,1841 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1803,1841 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c2, t2.c2
-> Sort
! Output: t1.c2, t2.c2
! Sort Key: t1.c2, t2.c2
-> Nested Loop
! Output: t1.c2, t2.c2
-> Foreign Scan on public.ft1 t1
! Output: t1.c2
! Remote SQL: SELECT c2 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c2
-> Foreign Scan on public.ft2 t2
! Output: t2.c2
! Remote SQL: SELECT c2 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
! c2 | c2
! ----+----
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
! 0 | 0
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2377,2394 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2377,2397 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 56b01d0..a9f7111
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 462,469 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 462,469 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 95afc2c..049d470
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3524,3529 ****
--- 3524,3543 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incsort" xreflabel="enable_incsort">
+ <term><varname>enable_incsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9e0a3e..5020f5c
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 92,98 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** ExplainNode(PlanState *planstate, List *
*** 974,979 ****
--- 974,982 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1504,1509 ****
--- 1507,1513 ----
planstate, es);
break;
case T_Sort:
+ case T_IncSort:
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
*************** static void
*** 1832,1840 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1836,1850 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncSort))
+ skipCols = ((IncSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_merge_append_keys(MergeAppendState
*** 1850,1856 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1860,1866 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1874,1880 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1884,1890 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 1930,1936 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 1940,1946 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1987,1993 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 1997,2003 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2000,2012 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2010,2023 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2046,2054 ****
--- 2057,2069 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2195,2206 ****
--- 2210,2230 ----
appendStringInfoSpaces(es->str, es->indent * 2);
appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
sortMethod, spaceType, spaceUsed);
+ if (sortstate->skipKeys)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ sortstate->groupsCount);
+ }
}
else
{
ExplainPropertyText("Sort Method", sortMethod, es);
ExplainPropertyLong("Sort Space Used", spaceUsed, es);
ExplainPropertyText("Sort Space Type", spaceType, es);
+ if (sortstate->skipKeys)
+ ExplainPropertyLong("Sort groups: %ld",
+ sortstate->groupsCount, es);
}
}
}
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index d380207..c6c3ab7
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecReScan(PlanState *node)
*** 235,240 ****
--- 235,241 ----
break;
case T_SortState:
+ case T_IncSortState:
ExecReScanSort((SortState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 509,516 ****
--- 510,521 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 0dd95c6..3cc4b77
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
*************** ExecInitNode(Plan *node, EState *estate,
*** 291,296 ****
--- 291,297 ----
break;
case T_Sort:
+ case T_IncSort:
result = (PlanState *) ExecInitSort((Sort *) node,
estate, eflags);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index aa08152..aa4d8e2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 638,644 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..28272be
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,123 ----
#include "postgres.h"
+ #include "access/htup_details.h"
#include "executor/execdebug.h"
#include "executor/nodeSort.h"
#include "miscadmin.h"
+ #include "utils/lsyscache.h"
#include "utils/tuplesort.h"
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleTableSlot *a, TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncSort));
+
+ n = ((IncSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(SortState *node)
+ {
+ IncSort *plannode;
+ int skipCols,
+ i;
+
+ plannode = (IncSort *) node->ss.ps.plan;
+ Assert(IsA(plannode, IncSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
/* ----------------------------------------------------------------
* ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 140,155 ----
ScanDirection dir;
Tuplesortstate *tuplesortstate;
TupleTableSlot *slot;
+ Sort *plannode = (Sort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ if (IsA(plannode, IncSort))
+ skipCols = ((IncSort *) plannode)->skipCols;
+ else
+ skipCols = 0;
/*
* get state info from node
*************** ExecSort(SortState *node)
*** 54,87 ****
tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
/*
* If first time through, read all tuples from outer plan and pass them to
* tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
*/
! if (!node->sort_Done)
! {
! Sort *plannode = (Sort *) node->ss.ps.plan;
! PlanState *outerNode;
! TupleDesc tupDesc;
!
! SO1_printf("ExecSort: %s\n",
! "sorting subplan");
! /*
! * Want to scan subplan in the forward direction while creating the
! * sorted data.
! */
! estate->es_direction = ForwardScanDirection;
! /*
! * Initialize tuplesort module.
! */
! SO1_printf("ExecSort: %s\n",
! "calling tuplesort_begin");
! outerNode = outerPlanState(node);
! tupDesc = ExecGetResultType(outerNode);
tuplesortstate = tuplesort_begin_heap(tupDesc,
plannode->numCols,
plannode->sortColIdx,
--- 162,204 ----
tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
/*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
* If first time through, read all tuples from outer plan and pass them to
* tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
*/
! SO1_printf("ExecSort: %s\n",
! "sorting subplan");
! /*
! * Want to scan subplan in the forward direction while creating the
! * sorted data.
! */
! estate->es_direction = ForwardScanDirection;
! /*
! * Initialize tuplesort module.
! */
! SO1_printf("ExecSort: %s\n",
! "calling tuplesort_begin");
! outerNode = outerPlanState(node);
! tupDesc = ExecGetResultType(outerNode);
+ if (skipCols == 0)
+ {
+ /* Regular case: no skip cols */
tuplesortstate = tuplesort_begin_heap(tupDesc,
plannode->numCols,
plannode->sortColIdx,
*************** ExecSort(SortState *node)
*** 89,132 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
! if (node->bounded)
! tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
! /*
! * Scan the subplan and feed all the tuples to tuplesort.
! */
! for (;;)
{
! slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
!
tuplesort_puttupleslot(tuplesortstate, slot);
}
! /*
! * Complete the sort.
! */
! tuplesort_performsort(tuplesortstate);
! /*
! * restore to user specified direction
! */
! estate->es_direction = dir;
! /*
! * finally set the sorted flag to true
! */
! node->sort_Done = true;
! node->bounded_Done = node->bounded;
! node->bound_Done = node->bound;
! SO1_printf("ExecSort: %s\n", "sorting done");
}
SO1_printf("ExecSort: %s\n",
"retrieving tuple from tuplesort");
--- 206,355 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
node->tuplesortstate = (void *) tuplesortstate;
! if (node->bounded)
! tuplesort_set_bound(tuplesortstate, node->bound);
! }
! else
! {
! /* Incremental sort case */
! if (node->tuplesortstate == NULL)
! {
! /*
! * We are going to process the first group of presorted data.
! * Initialize support structures for cmpSortSkipCols - already
! * sorted columns.
! */
! prepareSkipCols(node);
! /*
! * Only pass on remaining columns that are unsorted. Skip
! * abbreviated keys usage for incremental sort. We unlikely will
! * have huge groups with incremental sort. Therefore usage of
! * abbreviated keys would be likely a waste of time.
! */
! tuplesortstate = tuplesort_begin_heap(
! tupDesc,
! plannode->numCols - skipCols,
! &(plannode->sortColIdx[skipCols]),
! &(plannode->sortOperators[skipCols]),
! &(plannode->collations[skipCols]),
! &(plannode->nullsFirst[skipCols]),
! work_mem,
! false,
! true);
! node->tuplesortstate = (void *) tuplesortstate;
! node->groupsCount++;
! }
! else
{
! /* Next group of presorted data */
! tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! node->groupsCount++;
! }
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (skipCols == 0)
+ {
+ /* Regular sort case: put all tuples to the tuplesort */
if (TupIsNull(slot))
+ {
+ node->finished = true;
break;
! }
tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
}
+ else
+ {
+ /* Incremental sort case: put group of presorted data to the tuplesort */
+ if (node->prevSlot->tts_isempty)
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
! if (TupIsNull(slot))
! {
! node->finished = true;
! break;
! }
! else
! {
! bool cmp;
! cmp = cmpSortSkipCols(node, node->prevSlot, slot);
! /* Replace previous tuple with current one */
! ExecCopySlot(node->prevSlot, slot);
! /*
! * When skipCols are not equal then group of presorted data
! * is finished
! */
! if (!cmp)
! break;
! }
! }
! }
}
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecSort: %s\n", "sorting done");
+
SO1_printf("ExecSort: %s\n",
"retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 380,394 ----
"initializing sort node");
/*
+ * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ * or EXEC_FLAG_MARK, because we hold only current bucket in
+ * tuplesortstate.
+ */
+ Assert(IsA(node, Sort) || (eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
* create state structure
*/
sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 406,417 ----
sortstate->bounded = false;
sortstate->sort_Done = false;
+ sortstate->finished = false;
sortstate->tuplesortstate = NULL;
+ sortstate->prevSlot = NULL;
+ sortstate->bound_Done = 0;
+ sortstate->groupsCount = 0;
+ sortstate->skipKeys = NULL;
/*
* Miscellaneous initialization
*************** ExecInitSort(Sort *node, EState *estate,
*** 209,214 ****
--- 446,455 ----
ExecAssignScanTypeFromOuterPlan(&sortstate->ss);
sortstate->ss.ps.ps_ProjInfo = NULL;
+ /* make standalone slot to store previous tuple from outer node */
+ sortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(sortstate)));
+
SO1_printf("ExecInitSort: %s\n",
"sort node initialized");
*************** ExecEndSort(SortState *node)
*** 231,236 ****
--- 472,479 ----
ExecClearTuple(node->ss.ss_ScanTupleSlot);
/* must drop pointer to sort result tuple */
ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
/*
* Release tuplesort resources
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 561,567 ----
node->sort_Done = false;
tuplesort_end((Tuplesortstate *) node->tuplesortstate);
node->tuplesortstate = NULL;
+ node->bound_Done = 0;
/*
* if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 05d8538..8c47f44
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 837,842 ****
--- 837,860 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 847,859 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 865,893 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncSort
! */
! static IncSort *
! _copyIncSort(const IncSort *from)
! {
! IncSort *newnode = makeNode(IncSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObject(const void *from)
*** 4583,4588 ****
--- 4617,4625 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncSort:
+ retval = _copyIncSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b3802b4..7522cc3
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 781,792 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 781,790 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 809,814 ****
--- 807,830 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncSort(StringInfo str, const IncSort *node)
+ {
+ WRITE_NODE_TYPE("INCSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3482,3487 ****
--- 3498,3506 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncSort:
+ _outIncSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d2f69fe..0dcb86e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 1978,1989 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 1978,1990 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 1992,1997 ****
--- 1993,2024 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncSort
+ */
+ static IncSort *
+ _readIncSort(void)
+ {
+ READ_LOCALS(IncSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2520,2525 ****
--- 2547,2554 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCSORT", 7))
+ return_value = _readIncSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index eeacf81..22f81aa
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3097,3102 ****
--- 3097,3106 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncSortPath:
+ ptype = "IncSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d01630f..b98abc7
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1419,1424 ****
--- 1420,1432 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1445,1451 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1453,1460 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1461,1479 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1470,1497 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1499,1511 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1517,1566 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1515,1521 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1570,1576 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1526,1535 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1581,1590 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1537,1550 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1592,1617 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2300,2305 ****
--- 2367,2374 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2326,2331 ****
--- 2395,2402 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 1065b31..653e4e9
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 397,408 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1461,1486 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1494,1535 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1496,1502 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1545,1551 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 997bdcf..f103b04
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 227,233 ****
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 227,233 ----
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 242,251 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 242,253 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 423,428 ****
--- 425,431 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1068,1073 ****
--- 1071,1077 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1102,1110 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1106,1116 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1535,1540 ****
--- 1541,1547 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1544,1550 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1551,1561 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1790,1796 ****
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan);
agg_plan = (Plan *) make_agg(NIL,
NIL,
--- 1801,1808 ----
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan,
! 0);
agg_plan = (Plan *) make_agg(NIL,
NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3624,3631 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3636,3649 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3636,3643 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3654,3667 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4692,4698 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4716,4723 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5214,5226 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5239,5269 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incsort = false */
+ if (!enable_incsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncSort *incSort;
+
+ incSort = makeNode(IncSort);
+ node = &incSort->sort;
+ incSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5552,5558 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5595,5601 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5572,5578 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5615,5621 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5615,5621 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5658,5664 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5636,5642 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5679,5686 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5669,5675 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5713,5719 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6317,6322 ****
--- 6361,6367 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 3d33d46..557f885
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3497,3510 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3497,3510 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3577,3590 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3577,3590 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4240,4252 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4240,4252 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 5325,5332 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 5325,5333 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index be267b9..7835cc4
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 610,615 ****
--- 610,616 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 7954c44..4df783e
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2693,2698 ****
--- 2693,2699 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncSort:
case T_Unique:
case T_Gather:
case T_SetOp:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 06e843d..f3b9717
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 3248296..2777aca
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1293,1304 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1293,1305 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1312,1317 ****
--- 1313,1320 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1548,1554 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1551,1558 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2399,2407 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2403,2433 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncSortPath *incpathnode;
!
! incpathnode = makeNode(IncSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2415,2421 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2441,2449 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2687,2693 ****
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2715,2722 ----
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index f9f18f2..9607889
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 276,282 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index d14f0f9..a8fd978
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3521,3526 ****
--- 3521,3562 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 5d8fb2e..46a2c16
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 857,862 ****
--- 857,871 ----
NULL, NULL, NULL
},
{
+ {"enable_incsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ TupSortStatus maxStatus; /* maximum status reached between sort groups */
+ int64 maxMem; /* maximum amount of memory used between
+ sort groups */
+ bool maxMemOnDisk; /* is maxMem value for on-disk memory */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 715,721 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 774,787 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 823,829 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 854,860 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 945,951 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1018,1024 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1055,1061 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1230,1327 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 memUsed;
! bool memUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! memUsedOnDisk = true;
! memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! memUsedOnDisk = false;
! memUsed = state->allowedMem - state->availMem;
! }
!
! state->maxStatus = Max(state->maxStatus, state->status);
! if (memUsed > state->maxMem)
! {
! state->maxMem = memUsed;
! state->maxMemOnDisk = memUsedOnDisk;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3327,3341 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxMemOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxMem + 1023) / 1024;
! switch (state->maxStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 9f41bab..c95cb42
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1814,1819 ****
--- 1814,1833 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1825,1833 ****
--- 1839,1852 ----
bool bounded; /* is the result set bounded? */
int64 bound; /* if bounded, how many tuples are needed */
bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
bool bounded_Done; /* value of bounded we did the sort with */
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
} SortState;
/* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 95dd8ba..7a5dcf5
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 71,76 ****
--- 71,77 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 120,125 ****
--- 121,127 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 249,254 ****
--- 251,257 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index f72f7a8..6b96535
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 699,704 ****
--- 699,715 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index f7ac6f6..2c56105
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1331,1336 ****
--- 1331,1346 ----
} SortPath;
/*
+ * IncSortPath
+ */
+ typedef struct IncSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 72200fa..c26ef9a
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 96,104 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ebda308..3271203
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 180,185 ****
--- 180,186 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion);
*************** extern List *select_outer_pathkeys_for_m
*** 216,221 ****
--- 217,223 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------
! HashAggregate
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (6 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------------
! Group
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Incremental Sort
! Sort Key: t1.a, t1.b, t2.z
! Presorted Key: t1.a, t1.b
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (9 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index a8c8b28..11d697e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1448,1453 ****
--- 1448,1454 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1588,1596 ****
--- 1589,1633 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index d48abd7..f6a99d1
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 75,80 ****
--- 75,81 ----
enable_bitmapscan | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
*************** select name, setting from pg_settings wh
*** 83,89 ****
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (11 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 84,90 ----
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index a8b7eb1..5cf4426
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 498,503 ****
--- 498,504 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 559,567 ****
--- 560,585 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestion by
Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsible for
incremental sort: IncSortPath and IncSort which are inherited from SortPath
and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control planner ability
to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to push down.
I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.
I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea. Is that sort
incrementing, incremental, incredible, incautious, or incorporated?
The first hunk in the patch, a change in the postgres_fdw regression
test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.
I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 19, 2017 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestionby
Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsiblefor
incremental sort: IncSortPath and IncSort which are inherited from
SortPath
and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control plannerability
to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to pushdown.
I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea. Is that sort
incrementing, incremental, incredible, incautious, or incorporated?
I'm not that sure about naming of GUCs, because we already
have enable_hashagg instead of enable_hashaggregate, enable_material
instead of enable_materialize, enable_nestloop instead
of enable_nestedloop. But anyway I renamed "inc" to "Incremental"
everywhere in the code. I renamed enable_incsort GUC into
enable_incrementalsort as well, because I don't have strong opinion here.
The first hunk in the patch, a change in the postgres_fdw regression
test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.
Yes, I've changed regression test query as I mentioned in the previous
message. With incremental sort feature original query can't serve anymore
as an example of non-pushdown join. However, you're right that query which
returns all zeroes doesn't look good there either. So, I changed that
query to ordering by column "c3" which is actually non-indexed textual
representation of "c1".
I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.
Ok, I put incremental sort into separate executor node.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-2.patchapplication/octet-stream; name=incremental-sort-2.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 0b9e3e4..2f8aa6f
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1803,1841 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1803,1841 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2377,2394 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2377,2397 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 56b01d0..8a61277
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 462,469 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 462,469 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 1b390a2..cda89a3
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3524,3529 ****
--- 3524,3543 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9e0a3e..e1fe3b7
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 79,84 ****
--- 79,86 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 974,979 ****
--- 978,986 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1507,1512 ****
--- 1514,1525 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1832,1846 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1845,1882 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1850,1856 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1886,1892 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1874,1880 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1910,1916 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 1930,1936 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 1966,1972 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1987,1993 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2023,2029 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2000,2012 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2036,2049 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2046,2054 ****
--- 2083,2095 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2206,2211 ****
--- 2247,2289 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 2a2b7eb..d80883d
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execGroup
*** 23,30 ****
nodeLimit.o nodeLockRows.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
--- 23,31 ----
nodeLimit.o nodeLockRows.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o \
! nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index d380207..16df1b2
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 238,243 ****
--- 239,248 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 509,516 ****
--- 514,525 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index ef6f35a..5c77ab1
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 92,97 ****
--- 92,98 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 295,300 ****
--- 296,306 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 505,510 ****
--- 511,520 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 761,766 ****
--- 771,780 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index aa08152..aa4d8e2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 638,644 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04576c6
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,485 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ skipCols = plannode->skipCols;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Only pass on remaining columns that are unsorted. Skip
+ * abbreviated keys usage for incremental sort. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols - skipCols,
+ &(plannode->sort.sortColIdx[skipCols]),
+ &(plannode->sort.sortOperators[skipCols]),
+ &(plannode->sort.collations[skipCols]),
+ &(plannode->sort.nullsFirst[skipCols]),
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ /* Put next group of presorted data to the tuplesort */
+ if (node->prevSlot->tts_isempty)
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+
+ /* Replace previous tuple with current one */
+ ExecCopySlot(node->prevSlot, slot);
+
+ /*
+ * When skipCols are not equal then group of presorted data
+ * is finished
+ */
+ if (!cmp)
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->prevSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 05d8538..1288789
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 837,842 ****
--- 837,860 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 847,859 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 865,893 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObject(const void *from)
*** 4583,4588 ****
--- 4617,4625 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b3802b4..10cec96
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 781,792 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 781,790 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 809,814 ****
--- 807,830 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3482,3487 ****
--- 3498,3506 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d2f69fe..c1b084e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 1978,1989 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 1978,1990 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 1992,1997 ****
--- 1993,2024 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2520,2525 ****
--- 2547,2554 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 633b5c1..bdbd8bf
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3097,3102 ****
--- 3097,3106 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index c138f57..a131c10
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1418,1423 ****
--- 1419,1431 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1444,1450 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1452,1459 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1460,1478 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1469,1496 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1498,1510 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1516,1565 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1514,1520 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1569,1575 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1525,1534 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1580,1589 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1536,1549 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1591,1616 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2297,2302 ****
--- 2364,2371 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2323,2328 ****
--- 2392,2399 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 1065b31..9b06c6a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 397,408 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1461,1486 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1494,1535 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1496,1502 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1545,1551 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 1e953b4..5625f2a
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 227,233 ****
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 227,233 ----
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 242,251 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 242,253 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 423,428 ****
--- 425,431 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1067,1072 ****
--- 1070,1076 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1101,1109 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1105,1115 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1534,1539 ****
--- 1540,1546 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1543,1549 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1550,1560 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1789,1795 ****
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan);
agg_plan = (Plan *) make_agg(NIL,
NIL,
--- 1800,1807 ----
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan,
! 0);
agg_plan = (Plan *) make_agg(NIL,
NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3621,3628 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3633,3646 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3633,3640 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3651,3664 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4686,4692 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4710,4717 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5208,5220 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5233,5263 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5546,5552 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5589,5595 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5566,5572 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5609,5615 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5609,5615 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5652,5658 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5630,5636 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5673,5680 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5663,5669 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5707,5713 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6311,6316 ****
--- 6355,6361 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index ca0ae78..6e4f223
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3497,3510 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3497,3510 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3577,3590 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3577,3590 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4239,4251 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4239,4251 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 5324,5331 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 5324,5332 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 07ddbcf..0534ac8
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 608,613 ****
--- 608,614 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 3eb2bb7..69ad4d3
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2692,2697 ****
--- 2692,2698 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_SetOp:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 1389db1..0972d4b
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 3248296..1faf100
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1293,1304 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1293,1305 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1312,1317 ****
--- 1313,1320 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1548,1554 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1551,1558 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2399,2407 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2403,2433 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2415,2421 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2441,2449 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2687,2693 ****
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2715,2722 ----
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index f9f18f2..9607889
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 276,282 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 8b05e8f..ab66784
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3521,3526 ****
--- 3521,3562 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 0707f66..9e00658
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 857,862 ****
--- 857,871 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ TupSortStatus maxStatus; /* maximum status reached between sort groups */
+ int64 maxMem; /* maximum amount of memory used between
+ sort groups */
+ bool maxMemOnDisk; /* is maxMem value for on-disk memory */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 715,721 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 774,787 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 823,829 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 854,860 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 945,951 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1018,1024 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1055,1061 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1230,1327 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 memUsed;
! bool memUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! memUsedOnDisk = true;
! memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! memUsedOnDisk = false;
! memUsed = state->allowedMem - state->availMem;
! }
!
! state->maxStatus = Max(state->maxStatus, state->status);
! if (memUsed > state->maxMem)
! {
! state->maxMem = memUsed;
! state->maxMemOnDisk = memUsedOnDisk;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3327,3341 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxMemOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxMem + 1023) / 1024;
! switch (state->maxStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 6332ea0..0d63c65
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1817,1822 ****
--- 1817,1836 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1833,1838 ****
--- 1847,1872 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 95dd8ba..24b49a7
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 71,76 ****
--- 71,77 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 120,125 ****
--- 121,127 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 249,254 ****
--- 251,257 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index f72f7a8..2a776ee
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 699,704 ****
--- 699,715 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index f7ac6f6..b0ab815
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1331,1336 ****
--- 1331,1346 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 72200fa..09067f4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 96,104 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ebda308..3271203
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 180,185 ****
--- 180,186 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion);
*************** extern List *select_outer_pathkeys_for_m
*** 216,221 ****
--- 217,223 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------
! HashAggregate
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (6 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------------
! Group
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Incremental Sort
! Sort Key: t1.a, t1.b, t2.z
! Presorted Key: t1.a, t1.b
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (9 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index a8c8b28..2925e55
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1448,1453 ****
--- 1448,1454 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1588,1596 ****
--- 1589,1633 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index d48abd7..119f7d5
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,89 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (11 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,90 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index a8b7eb1..39dd786
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 498,503 ****
--- 498,504 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 559,567 ****
--- 560,585 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
This patch needs to be rebased.
1. It fails while applying as below
patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql
2. Also, there are compilation errors due to new commits.
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
gm_plan->nullsFirst);
^
createplan.c:235:14: note: expected ‘int’ but argument is of type ‘AttrNumber *’
static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
gm_plan->nullsFirst);
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dear Mithun,
On Mon, Mar 20, 2017 at 10:01 AM, Mithun Cy <mithun.cy@enterprisedb.com>
wrote:
On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:This patch needs to be rebased.
1. It fails while applying as below
patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql2. Also, there are compilation errors due to new commits.
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
gm_plan->nullsFirst);
^
createplan.c:235:14: note: expected ‘int’ but argument is of type
‘AttrNumber *’
static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
gm_plan->nullsFirst);
Thank you for the report.
Please, find rebased patch in the attachment.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-3.patchapplication/octet-stream; name=incremental-sort-3.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 059c5c3..185a0da
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1913,1951 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2487,2507 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 8f3edc1..a13d556
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index b379b67..3dfe6a5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3538,3543 ****
--- 3538,3557 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9b55ea..036a410
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 79,84 ****
--- 79,86 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 993,998 ****
--- 997,1005 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1561,1566 ****
--- 1568,1579 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1886,1900 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1899,1936 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1904,1910 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1940,1946 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1928,1934 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1964,1970 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 1984,1990 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2020,2026 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2041,2047 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2077,2083 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2054,2066 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2090,2103 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2100,2108 ****
--- 2137,2149 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2260,2265 ****
--- 2301,2343 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index d281906..1b97d1c
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execGroup
*** 23,30 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
nodeTableFuncscan.o
--- 23,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o \
! nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
nodeTableFuncscan.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 5d59f95..e04175a
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 243,248 ****
--- 244,253 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 514,521 ****
--- 519,530 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 80c77ad..1fa1de4
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 302,307 ****
--- 303,313 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 521,526 ****
--- 527,536 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 789,794 ****
--- 799,808 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 3207ee4..aa9dfcc
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 638,644 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04576c6
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,485 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ skipCols = plannode->skipCols;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Only pass on remaining columns that are unsorted. Skip
+ * abbreviated keys usage for incremental sort. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols - skipCols,
+ &(plannode->sort.sortColIdx[skipCols]),
+ &(plannode->sort.sortOperators[skipCols]),
+ &(plannode->sort.collations[skipCols]),
+ &(plannode->sort.nullsFirst[skipCols]),
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ /* Put next group of presorted data to the tuplesort */
+ if (node->prevSlot->tts_isempty)
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+
+ /* Replace previous tuple with current one */
+ ExecCopySlot(node->prevSlot, slot);
+
+ /*
+ * When skipCols are not equal then group of presorted data
+ * is finished
+ */
+ if (!cmp)
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->prevSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 25fd051..f82f620
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 885,890 ****
--- 885,908 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 895,907 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 913,941 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObject(const void *from)
*** 4686,4691 ****
--- 4720,4728 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 7418fbe..d78fd02
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 822,833 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 822,831 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 850,855 ****
--- 848,871 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3591,3596 ****
--- 3607,3615 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d3bbc02..65f7ff0
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2021,2032 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2021,2033 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2035,2040 ****
--- 2036,2067 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2587,2592 ****
--- 2614,2621 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 43bfd23..a9c9005
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3209,3214 ****
--- 3209,3218 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index a129d1e..5af59f1
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1563,1568 ****
--- 1564,1576 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1589,1595 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1597,1604 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1605,1623 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1614,1641 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1643,1655 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1661,1710 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1659,1665 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1714,1720 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1670,1679 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1725,1734 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1694 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1736,1761 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2447,2452 ****
--- 2514,2521 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2473,2478 ****
--- 2542,2549 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 89e1946..f80740e
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 232,238 ****
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 232,238 ----
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 247,256 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 247,258 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 431,436 ****
--- 433,439 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1087,1092 ****
--- 1090,1096 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1121,1129 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1125,1135 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1470,1475 ****
--- 1476,1482 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1496,1507 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1503,1518 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1609,1614 ****
--- 1620,1626 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1618,1624 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1630,1640 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1864,1870 ****
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan);
agg_plan = (Plan *) make_agg(NIL,
NIL,
--- 1880,1887 ----
sort_plan = (Plan *)
make_sort_from_groupcols(groupClause,
new_grpColIdx,
! subplan,
! 0);
agg_plan = (Plan *) make_agg(NIL,
NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3742,3749 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3759,3772 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3754,3761 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3777,3790 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4807,4813 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4836,4843 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5366,5378 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5396,5426 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5704,5710 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5752,5758 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5724,5730 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5772,5778 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5767,5773 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5815,5821 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5788,5794 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5836,5843 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5821,5827 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5870,5876 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6469,6474 ****
--- 6518,6524 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 02286d9..b9f8997
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3508,3521 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3508,3521 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3588,3601 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3588,3601 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4323,4335 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4323,4335 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 5458,5465 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 5458,5466 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 5f3027e..71fb394
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 623,628 ****
--- 623,629 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 6fa6540..2b7f081
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2698,2703 ****
--- 2698,2704 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 1389db1..0972d4b
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 8ce772d..e280f4b
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1294,1305 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1294,1306 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1313,1318 ****
--- 1314,1321 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1549,1555 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1552,1559 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1641,1646 ****
--- 1645,1651 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1663 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1662,1670 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1671,1676 ****
--- 1678,1685 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2486,2494 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2495,2525 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2502,2508 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2533,2541 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2774,2780 ****
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2807,2814 ----
break;
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index bb9a544..735bd15
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3522,3527 ****
--- 3522,3563 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 4feb26a..d4f5555
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ TupSortStatus maxStatus; /* maximum status reached between sort groups */
+ int64 maxMem; /* maximum amount of memory used between
+ sort groups */
+ bool maxMemOnDisk; /* is maxMem value for on-disk memory */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 715,721 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 774,787 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 823,829 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 854,860 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 945,951 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1018,1024 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1055,1061 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1230,1327 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 memUsed;
! bool memUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! memUsedOnDisk = true;
! memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! memUsedOnDisk = false;
! memUsed = state->allowedMem - state->availMem;
! }
!
! state->maxStatus = Max(state->maxStatus, state->status);
! if (memUsed > state->maxMem)
! {
! state->maxMem = memUsed;
! state->maxMemOnDisk = memUsedOnDisk;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3327,3341 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxMemOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxMem + 1023) / 1024;
! switch (state->maxStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index f856f60..347b551
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1899,1904 ****
--- 1899,1918 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1915,1920 ****
--- 1929,1954 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 2bc7a5d..22b2c46
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 72,77 ****
--- 72,78 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 123,128 ****
--- 124,130 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 255,260 ****
--- 257,263 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index b880dc1..990585e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 711,716 ****
--- 711,727 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 05d6f07..b386697
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1344,1349 ****
--- 1344,1359 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index d9a9b12..06827e3
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 100,107 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 101,109 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------
! HashAggregate
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (6 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
explain (costs off) select t1.*,t2.x,t2.z
from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
! QUERY PLAN
! -------------------------------------------------------------
! Group
Group Key: t1.a, t1.b, t2.x, t2.z
! -> Incremental Sort
! Sort Key: t1.a, t1.b, t2.z
! Presorted Key: t1.a, t1.b
! -> Merge Join
! Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
! -> Index Scan using t1_pkey on t1
! -> Index Scan using t2_pkey on t2
! (9 rows)
-- Cannot optimize when PK is deferrable
explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6494b20..c3e2609
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1454,1459 ****
--- 1454,1460 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1594,1602 ****
--- 1595,1639 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index e3e9e34..0bf3c01
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 499,504 ****
--- 499,505 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 560,568 ****
--- 561,586 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
* I'd love to have an explanation of what an Incremental Sort is, in the
file header comment for nodeIncrementalSort.c.
* I didn't understand the maxMem stuff in tuplesort.c. The comments
there use the phrase "on-disk memory", which seems like an oxymoron.
Also, "maximum status" seems weird, as it assumes that there's a natural
order to the states.
* In the below example, the incremental sort is significantly slower
than the Seq Scan + Sort you get otherwise:
create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';
postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Aggregate (cost=138655.68..138655.69 rows=1 width=8)
-> Incremental Sort (cost=610.99..124870.38 rows=1102824 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)
Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)
Time: 387.091 ms
postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)
Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)
Time: 231.668 ms
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Alexander,
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.
This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:
Hi Alexander,
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".
Thank you for reminder!
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Mar 20, 2017 at 5:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
* I'd love to have an explanation of what an Incremental Sort is, in the
file header comment for nodeIncrementalSort.c.
Done.
* I didn't understand the maxMem stuff in tuplesort.c. The comments there
use the phrase "on-disk memory", which seems like an oxymoron. Also,
"maximum status" seems weird, as it assumes that there's a natural order to
the states.
Variables were renamed.
* In the below example, the incremental sort is significantly slower than
the Seq Scan + Sort you get otherwise:
create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
------------------------------------------------------------
-------------------------------------------
Aggregate (cost=138655.68..138655.69 rows=1 width=8)
-> Incremental Sort (cost=610.99..124870.38 rows=1102824 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as
t;
count
---------
1000000
(1 row)Time: 387.091 ms
postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
------------------------------------------------------------
-------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as
t;
count
---------
1000000
(1 row)Time: 231.668 ms
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when the
group contains exactly one group, and not put the tuple to the tuplesort in
that case.
I'm not sure we should do such optimization for one tuple per group, since
it's similar situation with 2 or 3 tuples per group.
Or if we cannot ensure that the Incremental Sort is actually faster, the
cost model should probably be smarter, to avoid picking an incremental sort
when it's not a win.
I added to cost_sort() extra costing for incremental sort: cost of extra
tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:
run_cost += 10.0 * cpu_tuple_cost * num_groups;
It makes ordinal sort to be selected in your example, but it contains
constant 10 which is quite arbitrary. It would be nice to evade such hard
coded constants, but I don't know how could we calculate such cost
realistically.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-4.patchapplication/octet-stream; name=incremental-sort-4.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index a466bf2..1cabe3f
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1913,1951 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2487,2507 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 8f3edc1..a13d556
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index ac339fb..59763ab
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index ea19ba6..08222bc
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 79,84 ****
--- 79,86 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 993,998 ****
--- 997,1005 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1565,1570 ****
--- 1572,1583 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1890,1904 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1903,1940 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1908,1914 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1944,1950 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1932,1938 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1968,1974 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2001,2007 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2037,2043 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2058,2064 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2094,2100 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2071,2083 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2107,2120 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2117,2125 ****
--- 2154,2166 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2277,2282 ****
--- 2318,2360 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index d1c1324..5332e83
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
nodeTableFuncscan.o
--- 24,32 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o \
! nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
nodeTableFuncscan.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 5d59f95..e04175a
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 243,248 ****
--- 244,253 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 514,521 ****
--- 519,530 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 80c77ad..1fa1de4
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 302,307 ****
--- 303,313 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 521,526 ****
--- 527,536 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 789,794 ****
--- 799,808 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index ef35da6..afb5cb2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 733,739 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 734,740 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...5aa2c62
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ skipCols = plannode->skipCols;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Only pass on remaining columns that are unsorted. Skip
+ * abbreviated keys usage for incremental sort. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols - skipCols,
+ &(plannode->sort.sortColIdx[skipCols]),
+ &(plannode->sort.sortOperators[skipCols]),
+ &(plannode->sort.collations[skipCols]),
+ &(plannode->sort.nullsFirst[skipCols]),
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ /* Put next group of presorted data to the tuplesort */
+ if (TupIsNull(node->prevSlot))
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+
+ /* Replace previous tuple with current one */
+ ExecCopySlot(node->prevSlot, slot);
+
+ /*
+ * When skipCols are not equal then group of presorted data
+ * is finished
+ */
+ if (!cmp)
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->prevSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index c23d5c5..be3748d
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 889,894 ****
--- 889,912 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 899,911 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 917,945 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObject(const void *from)
*** 4733,4738 ****
--- 4767,4775 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index bbb63a4..7dfa56f
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 826,837 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 826,835 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 854,859 ****
--- 852,875 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3656,3661 ****
--- 3672,3680 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 474f221..40b712e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2025,2036 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2025,2037 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2039,2044 ****
--- 2040,2071 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2591,2596 ****
--- 2618,2625 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index a1e1a87..aca363d
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3234,3239 ****
--- 3234,3243 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 92de2b7..50f4502
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1563,1568 ****
--- 1564,1576 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1589,1595 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1597,1604 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1605,1623 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1614,1641 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1643,1655 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1661,1710 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1659,1665 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1714,1720 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1670,1679 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1725,1734 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1694 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1736,1768 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1699,1704 ****
--- 1773,1791 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 10.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2452,2457 ****
--- 2539,2546 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2478,2483 ****
--- 2567,2574 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index aafec58..9535622
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 232,238 ****
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 232,238 ----
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 247,256 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 247,258 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 431,436 ****
--- 433,439 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1087,1092 ****
--- 1090,1096 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1121,1129 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1125,1135 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1471,1476 ****
--- 1477,1483 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1508 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1504,1519 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1610,1615 ****
--- 1621,1627 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1619,1625 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1631,1641 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1863,1869 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1879,1886 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3755,3762 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3772,3785 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3767,3774 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3790,3803 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4820,4826 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4849,4856 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5380,5392 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5410,5440 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5718,5724 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5766,5772 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5738,5744 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5786,5792 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5781,5787 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5829,5835 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5802,5808 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5850,5857 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5835,5841 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5884,5890 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6484,6489 ****
--- 6533,6539 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index fa7a5f8..33fd370
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3751,3764 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3751,3764 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3831,3844 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3831,3844 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4905,4917 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4905,4917 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6040,6047 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6040,6048 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 5930747..01d1328
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 623,628 ****
--- 623,629 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 6fa6540..2b7f081
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2698,2703 ****
--- 2698,2704 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index d88738e..9ae0c88
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 999ebce..9769a5c
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1555,1562 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2489,2497 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2498,2528 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2505,2511 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2536,2544 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2813,2819 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2846,2853 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 5c382a2..6426e44
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3567,3572 ****
--- 3567,3608 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index e9d561b..1e8572d
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..ed189c2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,293 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 640,648 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 677,706 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 717,723 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 735,741 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 776,789 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 825,831 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 856,862 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 947,953 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1020,1026 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1057,1063 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1168,1179 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1232,1329 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3329,3343 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 11a6850..06184f4
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1655,1660 ****
--- 1655,1674 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1671,1676 ****
--- 1685,1710 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index b9369ac..e550f26
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 72,77 ****
--- 72,78 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 123,128 ****
--- 124,130 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 237,242 ****
--- 239,245 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 6e531b6..4959f95
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 720,725 ****
--- 720,736 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 8930edf..4bf6f3a
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1371,1376 ****
--- 1371,1386 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index d9a9b12..06827e3
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 100,107 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 101,109 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:
Hi Alexander,
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".
Thank you for reminder!
I've just done so. Please resubmit once updated, it's a cool feature.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>
wrote:
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the IncrementalSort
is actually faster, the cost model should probably be smarter, to
avoid
picking an incremental sort when it's not a win.
This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will bemarked
"Returned with Feedback".
Thank you for reminder!
I've just done so. Please resubmit once updated, it's a cool feature.
Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-5.patchapplication/octet-stream; name=incremental-sort-5.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 1a9e6c8..c27b63e
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1913,1951 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2487,2507 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index cf70ca2..94e0b3d
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index ac339fb..59763ab
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index a18ab43..1eb3f0d
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1576,1581 ****
--- 1583,1594 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1901,1915 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1914,1951 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1919,1925 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1955,1961 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1943,1949 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1979,1985 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2012,2018 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2048,2054 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2069,2075 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2105,2111 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2082,2094 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2118,2131 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2128,2136 ****
--- 2165,2177 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2288,2293 ****
--- 2329,2371 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index ef35da6..afb5cb2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 733,739 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 734,740 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...5aa2c62
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ skipCols = plannode->skipCols;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Only pass on remaining columns that are unsorted. Skip
+ * abbreviated keys usage for incremental sort. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols - skipCols,
+ &(plannode->sort.sortColIdx[skipCols]),
+ &(plannode->sort.sortOperators[skipCols]),
+ &(plannode->sort.collations[skipCols]),
+ &(plannode->sort.nullsFirst[skipCols]),
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ /* Put next group of presorted data to the tuplesort */
+ if (TupIsNull(node->prevSlot))
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+
+ /* Replace previous tuple with current one */
+ ExecCopySlot(node->prevSlot, slot);
+
+ /*
+ * When skipCols are not equal then group of presorted data
+ * is finished
+ */
+ if (!cmp)
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->prevSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 61bc502..0d6f628
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 910,915 ****
--- 910,933 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 920,932 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 938,966 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4758,4763 ****
--- 4792,4800 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 766ca49..c371afc
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 836,847 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 836,845 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 864,869 ****
--- 862,885 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3677,3682 ****
--- 3693,3701 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 766f2d8..5e487d4
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2032,2043 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2032,2044 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2046,2051 ****
--- 2047,2078 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2598,2603 ****
--- 2625,2632 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 343b35a..2191634
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3279,3284 ****
--- 3279,3288 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index ed07e2f..eb17370
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1600,1605 ****
--- 1601,1613 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1626,1632 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1642,1660 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1651,1678 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1698,1747 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1751,1757 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 10.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 2a78595..bbe776f
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 236,242 ****
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 236,242 ----
bool *mergenullsfirst,
Plan *lefttree, Plan *righttree,
JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 251,260 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,262 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 438,444 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1099,1104 ****
--- 1102,1108 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1133,1141 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1137,1147 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1483,1488 ****
--- 1489,1495 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1509,1520 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1516,1531 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1622,1627 ****
--- 1633,1639 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1631,1637 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1643,1653 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1875,1881 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1891,1898 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3806,3813 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3823,3836 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3818,3825 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3841,3854 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4871,4877 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4900,4907 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5451,5463 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5481,5511 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5789,5795 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5837,5843 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5809,5815 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5857,5863 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5852,5858 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5900,5906 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5873,5879 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5921,5928 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5906,5912 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5955,5961 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6555,6560 ****
--- 6604,6610 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index f99257b..09338c7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3751,3764 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3751,3764 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3831,3844 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3831,3844 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4905,4917 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4905,4917 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6040,6047 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6040,6048 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index cdb8e95..420d752
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 87cc44d..25fac59
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2702,2707 ****
--- 2702,2708 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e327e66..b2b8440
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 8536212..a99a1a7
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1555,1562 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2563,2571 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2873,2880 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 5c382a2..6426e44
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3567,3572 ****
--- 3567,3608 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 8b5f064..780d3b7
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 859,864 ****
--- 859,873 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..ed189c2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,293 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 640,648 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 677,706 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 717,723 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 735,741 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 776,789 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 825,831 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 856,862 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 947,953 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1020,1026 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1057,1063 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1168,1179 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1232,1329 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3329,3343 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index fa99244..0e59187
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1676,1681 ****
--- 1676,1695 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1692,1697 ****
--- 1706,1731 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 177853b..cf64e29
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 239,244 ****
--- 241,247 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a2dd26f..05e4f82
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 730,735 ****
--- 730,746 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index ebf9480..dd0478d
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1372,1377 ****
--- 1372,1387 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6909359..86dcdbb
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 103,111 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
wrote:On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>
wrote:
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in
ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for
when
the group contains exactly one group, and not put the tuple to
the
tuplesort in that case. Or if we cannot ensure that the
Incremental
Sort
is actually faster, the cost model should probably be smarter,
to
avoid
picking an incremental sort when it's not a win.
This thread has been idle for over a week. Please respond with a
new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be
marked
"Returned with Feedback".
Thank you for reminder!
I've just done so. Please resubmit once updated, it's a cool
feature.
Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.
Cool. I think that's still a bit late for v10?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 3, 2017 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
wrote:On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>
wrote:
On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.
I had a quick look at this.
<...>
According to 'perf', 85% of the CPU time is spent in
ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for
when
the group contains exactly one group, and not put the tuple to
the
tuplesort in that case. Or if we cannot ensure that the
Incremental
Sort
is actually faster, the cost model should probably be smarter,
to
avoid
picking an incremental sort when it's not a win.
This thread has been idle for over a week. Please respond with a
new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be
marked
"Returned with Feedback".
Thank you for reminder!
I've just done so. Please resubmit once updated, it's a cool
feature.
Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.Cool. I think that's still a bit late for v10?
I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:
Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.Cool. I think that's still a bit late for v10?
I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...
To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 4, 2017 at 12:09 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:
Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.Cool. I think that's still a bit late for v10?
I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...To me this hasn't gotten even remotely enough performance evaluation.
I'm ready to put my efforts on that.
And I don't think it's fair to characterize it as pending since 2013,
Probably, this duration isn't good characteristic at all.
given it was essentially "waiting on author" for most of that.
What makes you think so? Do you have some statistics? Or is it just
random assumption?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.
This is undeniably a patch which has been kicking around for a lot of
time without getting a lot of attention, and if it just keeps getting
punted down the road, it's never going to become committable.
Alexander's questions upthread about what decisions the committer who
took an interest (Heikki) would prefer never really got an answer, for
example. I don't deny that there may be some work left to do here,
but I think blaming the author for a week's delay when this has been
ignored so often for so long is unfair.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-04-03 22:18:21 -0400, Robert Haas wrote:
On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.This is undeniably a patch which has been kicking around for a lot of
time without getting a lot of attention, and if it just keeps getting
punted down the road, it's never going to become committable.
Indeed, it's old. And it hasn't gotten enough timely feedback.
But I don't think the wait time can meaningfully be measured by
subtracting two dates:
The first version of the patch, as a PoC, has been posted 2013-12-14,
which then got a good amount of feedback & revisions, and then stalled
till 2014-07-12. There a few back-and forths yielded a new version.
From 2014-09-15 till 2015-10-16 the patch stalled, waiting on its
author. That version had open todos ([1]http://archives.postgresql.org/message-id/CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1%2Bw%40mail.gmail.com), as had the version from
2016-03-13 [2]http://archives.postgresql.org/message-id/CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z%3DT4YJNCRWiHWHHweWw%40mail.gmail.com, which weren't addressed 2016-03-30 - unfortunately that
was pretty much when the tree was frozen. 2016-09-13 a rebased patch
was sent, some minor points were raised 2016-10-02 (unaddressed), a
larger review was done 2016-12-01 ([5]http://archives.postgresql.org/message-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com), unaddressed till 2017-02-18.
At that point we're in this thread.
There's obviously some long waiting-on-author periods in there. And
some long needs-review periods.
Alexander's questions upthread about what decisions the committer who
took an interest (Heikki) would prefer never really got an answer, for
example. I don't deny that there may be some work left to do here,
but I think blaming the author for a week's delay when this has been
ignored so often for so long is unfair.
I'm not trying to blame Alexander for a week's worth of delay, at all.
It's just that, well, we're past the original code-freeze date, three
days before the "final" code freeze. I don't think fairness is something
we can achieve at this point :(. Given the risk of regressions -
demonstrated in this thread although partially adressed - and the very
limited amount of benchmarking done, it seems unlikely that this is
going to be merged.
Regards,
Andres
[1]: http://archives.postgresql.org/message-id/CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1%2Bw%40mail.gmail.com
[2]: http://archives.postgresql.org/message-id/CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z%3DT4YJNCRWiHWHHweWw%40mail.gmail.com
[3]: http://archives.postgresql.org/message-id/CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
[4]: http://archives.postgresql.org/message-id/CAPpHfdvj1Tdi2WA64ZbBp5-yG-uzaRXzk3K7J7zt-cRX6YSd0A@mail.gmail.com
[5]: http://archives.postgresql.org/message-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
[6]: http://archives.postgresql.org/message-id/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 29, 2017 at 5:14 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
I added to cost_sort() extra costing for incremental sort: cost of extra
tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:run_cost += 10.0 * cpu_tuple_cost * num_groups;
It makes ordinal sort to be selected in your example, but it contains
constant 10 which is quite arbitrary. It would be nice to evade such hard
coded constants, but I don't know how could we calculate such cost
realistically.
That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.
On my laptop I have following numbers for test case provided by Heikki.
Presorted data – very fast.
# explain select count(*) from (select * from sorttest order by a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=147154.34..147154.35 rows=1 width=8)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)
# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)
Time: 260,752 ms
Not presorted data – not so fast. It's actually slower than incremental
sort was.
# explain select count(*) from (select * from sorttest order by a desc, c
desc) as t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a DESC, sorttest.c DESC
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)
# select count(*) from (select * from sorttest order by a desc, c desc) as
t;
count
---------
1000000
(1 row)
Time: 416,207 ms
Thus, it would be nice to reflect the fact that our quicksort
implementation is very fast on presorted data. Fortunately, we have
corresponding statistics: STATISTIC_KIND_CORRELATION. However, it probably
should be a subject of a separate patch.
But I'd like to make incremental sort not slower than quicksort in case of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of tuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples to
tuplesort until we have MIN_GROUP_SIZE tuples.
# explain select count(*) from (select * from sorttest order by a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Aggregate (cost=85412.43..85412.43 rows=1 width=8)
-> Incremental Sort (cost=0.46..72912.43 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)
# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)
Time: 251,227 ms
# explain select count(*) from (select * from sorttest order by a desc, c
desc) as t;
QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Aggregate (cost=85412.43..85412.43 rows=1 width=8)
-> Incremental Sort (cost=0.46..72912.43 rows=1000000 width=12)
Sort Key: sorttest.a DESC, sorttest.c DESC
Presorted Key: sorttest.a
-> Index Only Scan Backward using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)
# select count(*) from (select * from sorttest order by a desc, c desc) as
t;
count
---------
1000000
(1 row)
Time: 253,270 ms
Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.
# explain select * from sorttest order by a, c limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Limit (cost=0.46..0.84 rows=10 width=12)
-> Incremental Sort (cost=0.46..37500.78 rows=1000000 width=12)
Sort Key: a, c
Presorted Key: a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)
# select * from sorttest order by a, c limit 10;
a | b | c
----+----+----
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 4
5 | 5 | 5
6 | 6 | 6
7 | 7 | 7
8 | 8 | 8
9 | 9 | 9
10 | 10 | 10
(10 rows)
Time: 0,903 ms
Any thoughts?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-6.patchapplication/octet-stream; name=incremental-sort-6.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index d1bc5b0..c9de7ea
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1943,1981 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1943,1981 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2517,2534 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2517,2537 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 509bb54..263a646
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 487,494 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 487,494 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index e02b0c8..ad6b7d3
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9359d0a..52987bb
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1593,1598 ****
--- 1600,1611 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1918,1932 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1931,1968 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1936,1942 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1972,1978 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1960,1966 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1996,2002 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2029,2035 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2065,2071 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2086,2092 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2122,2128 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2111 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2135,2148 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2145,2153 ****
--- 2182,2194 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2305,2310 ****
--- 2346,2388 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index c2b8618..551664c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 736,742 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 737,743 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...052943b
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,546 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ int skipCols;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ skipCols = plannode->skipCols;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ if (!TupIsNull(node->prevSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ ExecClearTuple(node->prevSlot);
+ nTuples++;
+ }
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else if (TupIsNull(node->prevSlot))
+ {
+ /* First tuple */
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ ExecCopySlot(node->prevSlot, slot);
+ }
+ }
+ else
+ {
+ /* Put previous tuple into tuplesort */
+ tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ nTuples++;
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+ else
+ {
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+
+ /* Replace previous tuple with current one */
+ ExecCopySlot(node->prevSlot, slot);
+
+ /*
+ * When skipCols are not equal then group of presorted data
+ * is finished
+ */
+ if (!cmp)
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->prevSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->prevSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 924b458..1809e5d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 00a0fed..f57b6db
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 913,918 ****
--- 913,936 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 923,935 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 941,969 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4781,4786 ****
--- 4815,4823 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 28cef85..59eea51
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 839,850 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 839,848 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 867,872 ****
--- 865,888 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3693,3698 ****
--- 3709,3717 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index a883220..ccd49ec
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2036,2047 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2036,2048 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2050,2055 ****
--- 2051,2082 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2602,2607 ****
--- 2629,2636 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index b93b4fc..74c047a
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3280,3285 ****
--- 3280,3289 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 52643d0..165d049
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1600,1605 ****
--- 1601,1613 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1626,1632 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1642,1660 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1651,1678 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1698,1747 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1751,1757 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 95e6eb7..fbee577
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 237,243 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 237,243 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 252,261 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 252,263 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 437,442 ****
--- 439,445 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1112,1117 ****
--- 1115,1121 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1146,1154 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1150,1160 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1502 ****
--- 1503,1509 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1523,1534 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1530,1545 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1641,1646 ****
--- 1652,1658 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1650,1656 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1662,1672 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1894,1900 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1910,1917 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3830,3837 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3847,3860 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3842,3849 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3865,3878 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4901,4907 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4930,4937 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5490,5502 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5520,5550 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5829,5835 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5877,5883 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5849,5855 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5897,5903 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5892,5898 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5940,5946 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5913,5919 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5961,5968 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5946,5952 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5995,6001 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6596,6601 ****
--- 6645,6651 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 649a233..b1f85e6
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3752,3765 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3752,3765 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3832,3845 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3832,3845 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4906,4918 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4906,4918 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6041,6048 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6041,6049 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 1278371..2a894ae
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index c1be34d..88143d2
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2701,2706 ****
--- 2701,2707 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index a1be858..f3f885f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 973,979 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 973,980 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 2d5caae..eff7ac1
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1555,1562 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2563,2571 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2873,2880 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 8502fcf..0af631a
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a35b93b..885bf43
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3568,3573 ****
--- 3568,3609 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index a414fb2..761c093
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 861,866 ****
--- 861,875 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 5f62cd5..9822e27
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 720,726 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 779,792 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 828,834 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 859,865 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 950,956 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1025,1031 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1067,1073 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1242,1339 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3235,3261 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3345,3359 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 4330a85..fd69c0f
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1680,1685 ****
--- 1680,1699 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1696,1701 ****
--- 1710,1735 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *prevSlot; /* slot for previous tuple from outer node */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index f59d719..3e76ce3
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index cba9155..cfebbc5
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 740,745 ****
--- 740,756 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 7a8e2fd..9f5cc6f
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1418,1423 ****
--- 1418,1433 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index ed70def..47c26c4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 103,111 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 14b9026..4ea68e7
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.
As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.
I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 26, 2017 at 7:56 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.
OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.
The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.
The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 26, 2017 at 8:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.
I'm still focusing on making incremental sort not slower than qsort with
presorted optimization. Independently on whether this is "high risk"
optimization or not...
However, adding more test cases is always good.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
But I'd like to make incremental sort not slower than quicksort in case of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of tuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples to
tuplesort until we have MIN_GROUP_SIZE tuples.Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.Any thoughts?
Nice idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:But I'd like to make incremental sort not slower than quicksort in case
of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset oftuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuplesto
tuplesort until we have MIN_GROUP_SIZE tuples.
Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.Any thoughts?
Nice idea.
Cool.
Than I'm going to make a set of synthetic performance tests in order to
ensure that there is no regression.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Apr 27, 2017 at 5:23 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com>
wrote:On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:But I'd like to make incremental sort not slower than quicksort in case
of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset oftuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuplesto
tuplesort until we have MIN_GROUP_SIZE tuples.
Now, incremental sort is not slower than quicksort. And this seems to
be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.Any thoughts?
Nice idea.
Cool.
Than I'm going to make a set of synthetic performance tests in order to
ensure that there is no regression.
Next revision of patch is attached.
This revision contains one important optimization. I found that it's not
necessary to make every tuple go through prevTuple slot. It's enough to
save single sample tuple per sort group in order to compare skip columns
with it. This optimization allows to evade regression on large sort groups
which I have observed.
I'm also attaching python script (incsort_test.py) which I use for
synthetic performance benchmarking. This script runs benchmarks which are
similar to one posted by Heikki, but with some variations. These
benchmarks are aimed to check if there are cases when incremental sort is
slower than plain sort.
This script generates tables with structure described in 'tables' array.
For generation of texts, md5 function is used. For first GroupedCols
number of table columns, groups of GroupSize equal values are generated.
Then there are columns which values are just sequential. In the last column
have PreorderedFrac fraction of sequential values and rest of values are
random. Therefore, we can measure influence of presorted optimization in
qsort with various fractions of presorted data. Also there is btree index
which covers all the columns of that table.
The benchmark query select contents of generated table order by grouped
columns and by last column. Index only scan outputs tuples ordered by
grouped columns, and incremental sort have to perform sorting inside those
groups. Plain sort case is forced to also use index only scans, in order
to compare sort methods not scan methods.
Results are also attached (results.csv). Last column contains difference
between incremental and plain sort time in percents. Negative value mean
that incremental sort is faster in this case.
Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this case incremental
sort is useless, and it should be considered as misuse of incremental
sort. Slowdown is related to the fact that we anyway have to do extra
comparisons, unless we somehow push our comparison result into qsort itself
and save some cpu cycles (but that would be unreasonable break of
encapsulation). Thus, in such cases regression seems to be inevitable
anyway. I think we could evade this regression during query planning. If
we see that there would be only few groups, we should choose plain sort
instead of incremental sort.
Any thoughts?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-7.patchapplication/octet-stream; name=incremental-sort-7.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index d1bc5b0..c9de7ea
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1943,1981 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1943,1981 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2517,2534 ****
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! ----------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! -> Foreign Scan
! Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
! Relations: Aggregate on (public.ft1)
! Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
--- 2517,2537 ----
-- Aggregates in subquery are pushed down.
explain (verbose, costs off)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
! QUERY PLAN
! --------------------------------------------------------------------------------------------------------------------------
Aggregate
Output: count(ft1.c2), sum(ft1.c2)
! -> Incremental Sort
Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
Sort Key: ft1.c2, (sum(ft1.c1))
! Presorted Key: ft1.c2
! -> GroupAggregate
! Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
! Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
! -> Foreign Scan on public.ft1
! Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
count | sum
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 509bb54..263a646
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 487,494 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 487,494 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 0b9e300..84a26d9
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9359d0a..52987bb
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1593,1598 ****
--- 1600,1611 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1918,1932 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1931,1968 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1936,1942 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1972,1978 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1960,1966 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 1996,2002 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2029,2035 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2065,2071 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2086,2092 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2122,2128 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2111 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2135,2148 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2145,2153 ****
--- 2182,2194 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2305,2310 ****
--- 2346,2388 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
result = ExecSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ result = ExecIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
result = ExecGroup((GroupState *) node);
break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index c2b8618..551664c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 736,742 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 737,743 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...79ae888
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Interate while skip cols are same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 924b458..1809e5d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 35a237a..2c2e17d
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 915,920 ****
--- 915,938 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 925,937 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 943,971 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4784,4789 ****
--- 4818,4826 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 98f6768..6944701
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 841,852 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 841,850 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 869,874 ****
--- 867,890 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3697,3702 ****
--- 3713,3721 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index f9a227e..ce1db85
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2038,2049 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2038,2050 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2052,2057 ****
--- 2053,2084 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2604,2609 ****
--- 2631,2638 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index b93b4fc..74c047a
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3280,3285 ****
--- 3280,3289 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 52643d0..165d049
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1600,1605 ****
--- 1601,1613 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1626,1632 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1642,1660 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1651,1678 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1698,1747 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1751,1757 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 52daf43..3632215
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 237,243 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 237,243 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 252,261 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 252,263 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 437,442 ****
--- 439,445 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1112,1117 ****
--- 1115,1121 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1146,1154 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1150,1160 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1502 ****
--- 1503,1509 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1523,1534 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1530,1545 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1641,1646 ****
--- 1652,1658 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1650,1656 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1662,1672 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1894,1900 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1910,1917 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3830,3837 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3847,3860 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3842,3849 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3865,3878 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4901,4907 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4930,4937 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5490,5502 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5520,5550 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5829,5835 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5877,5883 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5849,5855 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5897,5903 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5892,5898 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5940,5946 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5913,5919 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5961,5968 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5946,5952 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5995,6001 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6597,6602 ****
--- 6646,6652 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index c4a5651..c1b8eb7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3755,3768 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3755,3768 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3835,3848 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3835,3848 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4909,4921 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4909,4921 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6044,6051 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6044,6052 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index c192dc4..92e9923
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index c1be34d..88143d2
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2701,2706 ****
--- 2701,2707 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_Gather:
case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index a1be858..f3f885f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 973,979 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 973,980 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 2d5caae..eff7ac1
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1555,1562 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2563,2571 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2873,2880 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 8502fcf..0af631a
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a35b93b..885bf43
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3568,3573 ****
--- 3568,3609 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucketsize fraction (ie, number of entries in a bucket
* divided by total tuples in relation) if the specified expression is used
* as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 587fbce..d2b2596
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 861,866 ****
--- 861,875 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 5f62cd5..9822e27
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 720,726 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 779,792 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 828,834 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 859,865 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 950,956 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1025,1031 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1067,1073 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1242,1339 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3235,3261 ****
const char **spaceType,
long *spaceUsed)
{
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
*spaceType = "Disk";
- *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
*spaceType = "Memory";
! *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3345,3359 ----
const char **spaceType,
long *spaceUsed)
{
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
*spaceType = "Disk";
else
*spaceType = "Memory";
! *spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index f289f3c..0b6ff3d
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1692,1697 ****
--- 1692,1711 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* SortState information
* ----------------
*************** typedef struct SortState
*** 1708,1713 ****
--- 1722,1747 ----
void *tuplesortstate; /* private state of tuplesort.c */
} SortState;
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index f59d719..3e76ce3
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 164105a..f845026
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 751,756 ****
--- 751,767 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index adbd3dd..96eebd3
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1419,1424 ****
--- 1419,1434 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index ed70def..47c26c4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 103,111 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 14b9026..4ea68e7
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 62,69 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this case incremental
sort is useless, and it should be considered as misuse of incremental sort.
Slowdown is related to the fact that we anyway have to do extra comparisons,
unless we somehow push our comparison result into qsort itself and save some
cpu cycles (but that would be unreasonable break of encapsulation). Thus,
in such cases regression seems to be inevitable anyway. I think we could
evade this regression during query planning. If we see that there would be
only few groups, we should choose plain sort instead of incremental sort.
I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great. Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully). It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable). For example, consider the case
where we need to sort 100m rows and there are 2 groups. If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort. On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big. On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough. On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 8, 2017 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this caseincremental
sort is useless, and it should be considered as misuse of incremental
sort.
Slowdown is related to the fact that we anyway have to do extra
comparisons,
unless we somehow push our comparison result into qsort itself and save
some
cpu cycles (but that would be unreasonable break of encapsulation).
Thus,
in such cases regression seems to be inevitable anyway. I think we could
evade this regression during query planning. If we see that there wouldbe
only few groups, we should choose plain sort instead of incremental sort.
I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great.
Thank you for paying attention to this patch!
Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully).
Right, I agree.
It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable). For example, consider the case
where we need to sort 100m rows and there are 2 groups. If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort. On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big. On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough. On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.
I agree with that. I need to test this patch more carefully in the case
when groups have different sizes. It's likely I need to add yet another
parameter to my testing script: groups count skew.
Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-8.patchapplication/octet-stream; name=incremental-sort-8.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index c19b331..38c7e11
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1981,2019 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 5f65d9d..5dc7a24
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 5f59a38..ac9c9f0
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3591,3596 ****
--- 3591,3610 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 4cee357..56aaa6f
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1016,1021 ****
--- 1020,1028 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1606,1611 ****
--- 1613,1624 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1931,1945 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1944,1981 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1949,1955 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1985,1991 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1973,1979 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 2009,2015 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2042,2048 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2078,2084 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2105 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2135,2141 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2112,2124 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2148,2161 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2158,2166 ****
--- 2195,2207 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2369,2374 ****
--- 2410,2504 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 8737cc1..2c8fa93
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 28,33 ****
--- 28,34 ----
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 258,263 ****
--- 259,268 ----
/* even when not parallel-aware */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 330,335 ****
--- 335,344 ----
/* even when not parallel-aware */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 706,711 ****
--- 715,724 ----
/* even when not parallel-aware */
ExecSortReInitializeDSM((SortState *) planstate, pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 764,769 ****
--- 777,784 ----
*/
if (IsA(planstate, SortState))
ExecSortRetrieveInstrumentation((SortState *) planstate);
+ else if (IsA(planstate, IncrementalSortState))
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 985,990 ****
--- 1000,1009 ----
/* even when not parallel-aware */
ExecSortInitializeWorker((SortState *) planstate, toc);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, toc);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 0ae5873..dab5a1e
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 742,748 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 743,749 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04059cc
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,644 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Interate while skip cols are same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
+
+ /* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+ /* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc)
+ {
+ node->shared_info =
+ shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 98bcaeb..2bddf63
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index f1bed14..0082db3
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 917,922 ****
--- 917,940 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 927,939 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 945,973 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4789,4794 ****
--- 4823,4831 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b83d919..8619847
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 861,872 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 861,870 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 889,894 ****
--- 887,910 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3728,3733 ****
--- 3744,3752 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index fbf8330..5fdba3a
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2053,2064 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2053,2065 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2067,2072 ****
--- 2068,2099 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2624,2629 ****
--- 2651,2658 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 2d7e1d8..010fc2c
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3281,3286 ****
--- 3281,3290 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 051a854..f779ef9
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1600,1605 ****
--- 1601,1613 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1626,1632 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1642,1660 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1651,1678 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1698,1747 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1751,1757 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2499,2504 ****
--- 2586,2593 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2525,2530 ****
--- 2614,2621 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9d83a5c..910f285
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 2821662..4c5d14f
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 250,259 ****
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 250,261 ----
static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 435,440 ****
--- 437,443 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1110,1115 ****
--- 1113,1119 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1144,1152 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1148,1158 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1496,1501 ****
--- 1502,1508 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1525,1536 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1532,1547 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1643,1648 ****
--- 1654,1660 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1652,1658 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1664,1674 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1896,1902 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1912,1919 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3834,3841 ****
*/
if (best_path->outersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3851,3864 ----
*/
if (best_path->outersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3853 ****
if (best_path->innersortkeys)
{
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3869,3882 ----
if (best_path->innersortkeys)
{
! Sort *sort;
! int n_common_pathkeys;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4899,4905 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4928,4935 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5484,5496 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5514,5544 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5823,5829 ****
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5871,5877 ----
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5843,5849 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5891,5897 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5886,5892 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5934,5940 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5907,5913 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5955,5962 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5940,5946 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5989,5995 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6596,6601 ****
--- 6645,6651 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index bba8a1f..eca8561
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 6b79b3a..e239217
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3769,3782 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3769,3782 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3849,3862 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3849,3862 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4923,4935 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4923,4935 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6058,6065 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6058,6066 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index b0c9e94..65d44e7
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 1103984..8278316
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2765,2770 ****
--- 2765,2771 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index ccf2145..e6c5600
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 989,995 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 989,996 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 26567cb..ef03c21
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 95,101 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1296,1307 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1296,1308 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1315,1320 ****
--- 1316,1323 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1551,1557 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1554,1561 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1643,1648 ****
--- 1647,1653 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1659,1665 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1664,1672 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1678 ****
--- 1680,1687 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2563,2571 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2873,2880 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 25905a3..6d165be
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 277,283 ----
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
! work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index db1792b..3cb1ded
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3641,3646 ****
--- 3641,3682 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index bc9f09a..f7ab820
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 862,867 ****
--- 862,876 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 17e1b68..f331d88
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 720,726 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 779,792 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 828,834 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 859,865 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 950,956 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1025,1031 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1067,1073 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1242,1339 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3241,3258 ****
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! if (state->tapeset)
! {
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3351,3365 ----
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
else
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...cfe944f
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 90a60ab..c21113a
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1730,1735 ****
--- 1730,1749 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
*************** typedef struct SortState
*** 1758,1763 ****
--- 1772,1815 ----
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+ /* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+ typedef struct IncrementalSortInfo
+ {
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+ } IncrementalSortInfo;
+
+ typedef struct SharedIncrementalSortInfo
+ {
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 27bd4f3..ae772e8
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a382331..c592183
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index a39e59d..5a17189
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1419,1424 ****
--- 1419,1434 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 63feba0..04553d1
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 103,111 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 4e06b2e..4f2fe81
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 90,97 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 1fa9650..1883170
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ----------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (12 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 70,91 ----
-- This is to record the prevailing planner enable_foo settings during
-- a regression test run.
select name, setting from pg_settings where name like 'enable%';
! name | setting
! ------------------------+---------
! enable_bitmapscan | on
! enable_gathermerge | on
! enable_hashagg | on
! enable_hashjoin | on
! enable_incrementalsort | on
! enable_indexonlyscan | on
! enable_indexscan | on
! enable_material | on
! enable_mergejoin | on
! enable_nestloop | on
! enable_seqscan | on
! enable_sort | on
! enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index c96580c..b389c63
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.
New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0
to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558
For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2
Results shows that between 2172 test cases, in 2113 incremental sort gives
speedup while in 59 it causes slowdown. The following 4 test cases show
most significant slowdown (>10% of time).
Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15
As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.
Any thoughts?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from
0 to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2Results shows that between 2172 test cases, in 2113 incremental sort gives
speedup while in 59 it causes slowdown. The following 4 test cases show
most significant slowdown (>10% of time).Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 <(778)%20519-8689> 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.Any thoughts?
BTW, replacement selection sort was removed by 8b304b8b. I think it worth
to rerun benchmarks after that, because results might be changed. Will do.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sat, Sep 30, 2017 at 11:20 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from
0 to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2Results shows that between 2172 test cases, in 2113 incremental sort
gives speedup while in 59 it causes slowdown. The following 4 test cases
show most significant slowdown (>10% of time).Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 <(778)%20519-8689> 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.Any thoughts?
BTW, replacement selection sort was removed by 8b304b8b. I think it worth
to rerun benchmarks after that, because results might be changed. Will do.
I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes. There
is still minority of performance regression cases while majority of cases
has improvement.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes. There
is still minority of performance regression cases while majority of cases
has improvement.
Yes, I think these results look pretty good. But are these times in
seconds? You might need to do some testing with bigger sorts.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 3, 2017 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes.There
is still minority of performance regression cases while majority of cases
has improvement.Yes, I think these results look pretty good. But are these times in
seconds? You might need to do some testing with bigger sorts.
Good point. I'll rerun benchmarks with larger dataset size.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached. I'm going to improve my testing script and post new results.
I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
On Tue, Nov 14, 2017 at 7:00 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.
I wanted to review this patch but incremental-sort-8.patch fails to apply.
Can
you please rebase it again?
Sure, please find rebased patch attached.
Also, I'd like to share partial results of benchmarks with 100M of rows.
It appears that for 100M of rows it takes quite amount of time. Perhaps in
cases when there were degradation on 1M of rows, it becomes somewhat
bigger...
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-9.patchapplication/octet-stream; name=incremental-sort-9.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 4339bbf..df72ab1
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1981,2019 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index ddfec79..c8c6fb7
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index d360fc4..1e878bf
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3552,3557 ****
--- 3552,3571 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 8f7062c..b46dc17
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1010,1015 ****
--- 1014,1022 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1600,1605 ****
--- 1607,1618 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1925,1939 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1938,1975 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1943,1949 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1979,1985 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1967,1973 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 2003,2009 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2036,2042 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2072,2078 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2093,2099 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2129,2135 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2106,2118 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2142,2155 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2152,2160 ****
--- 2189,2201 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2363,2368 ****
--- 2404,2498 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index fd7e7cb..74c1da9
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 28,33 ****
--- 28,34 ----
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 258,263 ****
--- 259,268 ----
/* even when not parallel-aware */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 330,335 ****
--- 335,344 ----
/* even when not parallel-aware */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 703,708 ****
--- 712,721 ----
/* even when not parallel-aware */
ExecSortReInitializeDSM((SortState *) planstate, pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 761,766 ****
--- 774,781 ----
*/
if (IsA(planstate, SortState))
ExecSortRetrieveInstrumentation((SortState *) planstate);
+ else if (IsA(planstate, IncrementalSortState))
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 982,987 ****
--- 997,1006 ----
/* even when not parallel-aware */
ExecSortInitializeWorker((SortState *) planstate, toc);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, toc);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index d26ce08..3c37bda
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 754,760 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04059cc
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,644 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by xfollowing groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Interate while skip cols are same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
+
+ /* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+ /* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc)
+ {
+ node->shared_info =
+ shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 98bcaeb..2bddf63
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 76e7545..a0061a6
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 917,922 ****
--- 917,940 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 927,939 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 945,973 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4801,4806 ****
--- 4835,4843 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index dc35df9..a6709c9
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 866,877 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 866,875 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 894,899 ****
--- 892,915 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3734,3739 ****
--- 3750,3758 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 593658d..9e8476a
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2059,2071 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2632,2637 ****
--- 2659,2666 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 906d08a..28f2b74
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3459,3464 ****
--- 3459,3468 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 2d2df60..e56b3a2
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1601,1606 ****
--- 1602,1614 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1627,1633 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1642 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1643,1661 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1652,1679 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = false;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1699,1748 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! lfirst(list_head(key->pk_eclass->ec_members));
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1752,1758 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1763,1772 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1774,1806 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1811,1829 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2631,2638 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2659,2666 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..30a755c
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies the given pathkeys and parameterization.
! * Return NULL if no such path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
--- 402,413 ----
/*
* get_cheapest_fractional_path_for_pathkeys
* Find the cheapest path (for retrieving a specified fraction of all
! * the tuples) that satisfies given parameterization and at least partially
! * satisfies the given pathkeys. Return NULL if no path found.
! * If pathkeys are satisfied only partially then we would have to do
! * incremental sort in order to satisfy pathkeys completely. Since
! * incremental sort consumes data by presorted groups, we would have to
! * consume more data than in the case of fully presorted path.
*
* See compare_fractional_path_costs() for the interpretation of the fraction
* parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1521,1562 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1572,1578 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 9c74e39..71b2b4a
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1157,1167 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1673,1683 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! best_path->subpath->pathkeys);
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1921,1928 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
*/
if (best_path->outersortkeys)
{
Relids outer_relids = outer_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys,
! outer_relids);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3862,3876 ----
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
if (best_path->innersortkeys)
{
Relids inner_relids = inner_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys,
! inner_relids);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3881,3895 ----
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4915,4921 ****
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4941,4948 ----
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL, 0,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5531,5561 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
!
! /* Always use regular sort node when enable_incrementalsort = false */
! if (!enable_incrementalsort)
! skipCols = 0;
!
! if (skipCols == 0)
! {
! node = makeNode(Sort);
! }
! else
! {
! IncrementalSort *incrementalSort;
!
! incrementalSort = makeNode(IncrementalSort);
! node = &incrementalSort->sort;
! incrementalSort->skipCols = skipCols;
! }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5845,5851 ****
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5890,5897 ----
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5865,5871 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5911,5917 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5954,5960 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5975,5982 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 6009,6015 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6618,6623 ****
--- 6665,6671 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 90fd9cc..ce2acac
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3843,3856 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3843,3856 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3923,3936 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3923,3936 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4997,5009 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 4997,5009 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6133,6140 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6133,6141 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index fa9a3f0..407568a
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 638,643 ****
--- 638,644 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 68dee0f..1c2b815
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 103,109 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1564,1570 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1567,1574 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1662 ****
--- 1661,1667 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1679 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1678,1686 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1687,1692 ****
--- 1694,1701 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2543,2551 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2552,2582 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2559,2565 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2590,2598 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2871,2877 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2904,2911 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 291,298 ----
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed,
! false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 4bbb4a8..d9c3243
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3650,3655 ****
--- 3650,3691 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index c4c1afa..d9195ef
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 34af8d6..a92b477
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 657,663 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 716,729 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 765,771 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 796,802 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 887,893 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 962,968 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1004,1010 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1179,1276 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2950,2967 ****
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! if (state->tapeset)
! {
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3060,3074 ----
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
else
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...cfe944f
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+ /* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+ typedef struct IncrementalSortInfo
+ {
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+ } IncrementalSortInfo;
+
+ typedef struct SharedIncrementalSortInfo
+ {
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index ffeeb49..4b78045
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a127682..2e3e5f2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 9e68e65..f0a37e5
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 104,112 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 90,97 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index c698faf..fec6a4e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 86,92 ----
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (14 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index 169d0dc..558246b
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached. I'm going to improve my testing script and post new results.
I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?
I could find the matching HEAD quite easily (9b6cb46), so following are my comments:
* cost_sort()
** "presorted_keys" missing in the prologue
** when called from label_sort_with_costsize(), 0 is passed for
"presorted_keys". However label_sort_with_costsize() can sometimes be
called on an IncrementalSort, in which case there are some "presorted
keys". See create_mergejoin_plan() for example. (IIUC this should only
make EXPLAIN inaccurate, but should not cause incorrect decisions.)
** instead of
if (!enable_incrementalsort)
presorted_keys = false;
you probably meant
if (!enable_incrementalsort)
presorted_keys = 0;
** instead of
/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));
you can use linitial():
/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
linitial(key->pk_eclass->ec_members);
* get_cheapest_fractional_path_for_pathkeys()
The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.
* pathkeys_useful_for_ordering()
Extra whitespace following the comment opening string "/*":
/*
* When incremental sort is disabled, pathkeys are useful only when they
* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned in
the prologue.
* create_sort_plan()
I noticed that pathkeys_common() is called, but the value of n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.
* create_unique_plan() does not seem to make use of the incremental
sort. Shouldn't it do?
* nodeIncrementalSort.c
** These comments seem to contain typos:
"Incremental sort algorithm would sort by xfollowing groups, which have ..."
"Interate while skip cols are same as in saved tuple"
** (This is rather a pedantic comment) I think prepareSkipCols() should be
defined in front of cmpSortSkipCols().
** the MIN_GROUP_SIZE constant deserves a comment.
* ExecIncrementalSort()
** if (node->tuplesortstate == NULL)
If both branches contain the expression
node->groupsCount++;
I suggest it to be moved outside the "if" construct.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Sure, please find rebased patch attached.
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?
Embarrassingly, I was unaware of this patch and started prototyping
exactly the same thing independently[1]https://github.com/macdice/postgres/commit/ab0f8aff9c4b25d5598aa6b3c630df864302f572. I hadn't got very far and
will now abandon that, but that's one thing I did differently. Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that). I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?
[1]: https://github.com/macdice/postgres/commit/ab0f8aff9c4b25d5598aa6b3c630df864302f572
--
Thomas Munro
http://www.enterprisedb.com
Hi!
Thank you very much for review. I really appreciate this topic gets
attention. Please, find next revision of patch in the attachment.
On Wed, Nov 15, 2017 at 7:20 PM, Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.
I wanted to review this patch but incremental-sort-8.patch fails to
apply. Can
you please rebase it again?
I could find the matching HEAD quite easily (9b6cb46), so following are my
comments:* cost_sort()
** "presorted_keys" missing in the prologue
Comment is added.
** when called from label_sort_with_costsize(), 0 is passed for
"presorted_keys". However label_sort_with_costsize() can sometimes be
called on an IncrementalSort, in which case there are some "presorted
keys". See create_mergejoin_plan() for example. (IIUC this should only
make EXPLAIN inaccurate, but should not cause incorrect decisions.)
Good catch. Fixed.
** instead of
if (!enable_incrementalsort)
presorted_keys = false;you probably meant
if (!enable_incrementalsort)
presorted_keys = 0;
Absolutely correct. Fixed.
** instead of
/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));you can use linitial():
/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
linitial(key->pk_eclass->ec_members);
Sure. Fixed.
* get_cheapest_fractional_path_for_pathkeys()
The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.
Good catch. This is a part of optimization for build_minmax_path() which
existed in earlier version of patch. That optimization contained set of
arguable solutions. This is why I decided to wipe it out from the patch,
and let it wait for initial implementation to be committed.
* pathkeys_useful_for_ordering()
Extra whitespace following the comment opening string "/*":
/*
* When incremental sort is disabled, pathkeys are useful only when they
Fixed.
* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned
in
the prologue.
Comment is added.
* create_sort_plan()
I noticed that pathkeys_common() is called, but the value of
n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.
Sounds like reasonable optimization. Done.
* create_unique_plan() does not seem to make use of the incremental
sort. Shouldn't it do?
It definitely should. But proper solution doesn't seem to be easy for me.
We should construct possibly useful paths before. Wherein it should be
done in agnostic manner to the order of pathkeys. I'm afraid for possible
regression in query planning. Therefore, it seems like a topic for
separate discussion. I would prefer to commit some basic implementation
first and then consider smaller patches with possible enhancement including
this one.
* nodeIncrementalSort.c
** These comments seem to contain typos:
"Incremental sort algorithm would sort by xfollowing groups, which have
...""Interate while skip cols are same as in saved tuple"
Fixed.
** (This is rather a pedantic comment) I think prepareSkipCols() should be
defined in front of cmpSortSkipCols().
That's a good comment. We're trying to be as pedantic about code as we can
:)
Fixed.
** the MIN_GROUP_SIZE constant deserves a comment.
Sure. Explanation was added.
* ExecIncrementalSort()
** if (node->tuplesortstate == NULL)
If both branches contain the expression
node->groupsCount++;
I suggest it to be moved outside the "if" construct.
Done.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-10.patchapplication/octet-stream; name=incremental-sort-10.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 4339bbf..df72ab1
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1981,2019 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index ddfec79..c8c6fb7
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index fc1752f..291360f
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3552,3557 ****
--- 3552,3571 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 447f69d..a646d82
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1011,1016 ****
--- 1015,1023 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1611,1616 ****
--- 1618,1629 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1936,1950 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1949,1986 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1954,1960 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1990,1996 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1978,1984 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 2014,2020 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2047,2053 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2083,2089 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2104,2110 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2140,2146 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2117,2129 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2153,2166 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2163,2171 ****
--- 2200,2212 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2374,2379 ****
--- 2415,2509 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index cc09895..572aca0
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 53c5254..f3d6876
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 29,34 ****
--- 29,35 ----
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 263,268 ****
--- 264,273 ----
/* even when not parallel-aware */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 462,467 ****
--- 467,476 ----
/* even when not parallel-aware */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 876,881 ****
--- 885,894 ----
/* even when not parallel-aware */
ExecSortReInitializeDSM((SortState *) planstate, pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 934,939 ****
--- 947,954 ----
*/
if (IsA(planstate, SortState))
ExecSortRetrieveInstrumentation((SortState *) planstate);
+ else if (IsA(planstate, IncrementalSortState))
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 1164,1169 ****
--- 1179,1189 ----
/* even when not parallel-aware */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index d26ce08..3c37bda
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 754,760 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...1a1e48f
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,649 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
+
+ /* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+ /* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+ {
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 73aa371..ef3587c
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index d9ff8a7..417a8d2
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 919,924 ****
--- 919,942 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 929,941 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 947,975 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4803,4808 ****
--- 4837,4845 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index c97ee24..6cb9300
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 869,880 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 869,878 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 897,902 ****
--- 895,918 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3737,3742 ****
--- 3753,3761 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 7eb67fc..f2b0e75
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2059,2071 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2634,2639 ****
--- 2661,2668 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 906d08a..28f2b74
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3459,3464 ****
--- 3459,3468 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d11bf19..2f7cf60
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1601,1606 ****
--- 1602,1614 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1627,1633 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1643 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'presorted_keys' is a number of pathkeys already presorted in given path
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1643,1661 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1653,1680 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1700,1749 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! linitial(key->pk_eclass->ec_members);
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1753,1759 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1764,1773 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1775,1807 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1812,1830 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2632,2639 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2660,2667 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..b97f22a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1517,1558 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1568,1574 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index d445477..b080fa6
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1157,1167 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1673,1685 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! if (IsA(best_path, IncrementalSortPath))
! n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
! else
! n_common_pathkeys = 0;
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1923,1930 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
*/
if (best_path->outersortkeys)
{
Relids outer_relids = outer_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys,
! outer_relids);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3864,3878 ----
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
if (best_path->innersortkeys)
{
Relids inner_relids = inner_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys,
! inner_relids);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3883,3897 ----
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4914,4921 ****
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4942,4954 ----
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
! if (IsA(plan, IncrementalSort))
! skip_cols = ((IncrementalSort *) plan)->skipCols;
!
! cost_sort(&sort_path, root, NIL, skip_cols,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5537,5567 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
!
! /* Always use regular sort node when enable_incrementalsort = false */
! if (!enable_incrementalsort)
! skipCols = 0;
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5843,5851 ****
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5894,5904 ----
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5865,5871 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5918,5924 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5961,5967 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5982,5989 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 6016,6022 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6619,6624 ****
--- 6673,6679 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index f6b8bbf..a7955e5
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3852,3865 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3852,3865 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3932,3945 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3932,3945 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 5006,5018 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 5006,5018 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6142,6149 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6142,6150 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 28a7f7e..90df9cc
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 642,647 ****
--- 642,648 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 68dee0f..1c2b815
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 103,109 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1564,1570 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1567,1574 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1662 ****
--- 1661,1667 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1679 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1678,1686 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1687,1692 ****
--- 1694,1701 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2543,2551 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2552,2582 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2559,2565 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2590,2598 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2871,2877 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2904,2911 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 291,298 ----
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed,
! false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 4bbb4a8..d9c3243
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3650,3655 ****
--- 3650,3691 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 6dcd738..192d3c8
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 34af8d6..a92b477
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 657,663 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 716,729 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 765,771 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 796,802 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 887,893 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 962,968 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1004,1010 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1179,1276 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2950,2967 ****
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! if (state->tapeset)
! {
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3060,3074 ----
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
else
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...b2e4e50
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+ /* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+ typedef struct IncrementalSortInfo
+ {
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+ } IncrementalSortInfo;
+
+ typedef struct SharedIncrementalSortInfo
+ {
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index ffeeb49..4b78045
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 9b38d44..0694fb2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 9e68e65..f0a37e5
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 104,112 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 90,97 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index c698faf..fec6a4e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 86,92 ----
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (14 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index 169d0dc..558246b
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Hi!
On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:
On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:Sure, please find rebased patch attached.
+ /* + * Check if first "skipCols" sort values are equal. + */ + static bool + cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a, + TupleTableSlot *b) + { + int n, i; + + Assert(IsA(node->ss.ps.plan, IncrementalSort)); + + n = ((IncrementalSort *) node->ss.ps.plan)->skipCols; + + for (i = 0; i < n; i++) + { + Datum datumA, datumB, result; + bool isnullA, isnullB; + AttrNumber attno = node->skipKeys[i].attno; + SkipKeyData *key; + + datumA = slot_getattr(a, attno, &isnullA); + datumB = slot_getattr(b, attno, &isnullB); + + /* Special case for NULL-vs-NULL, else use standard comparison */ + if (isnullA || isnullB) + { + if (isnullA == isnullB) + continue; + else + return false; + } + + key = &node->skipKeys[i]; + + key->fcinfo.arg[0] = datumA; + key->fcinfo.arg[1] = datumB; + + /* just for paranoia's sake, we reset isnull each time */ + key->fcinfo.isnull = false; + + result = FunctionCallInvoke(&key->fcinfo); + + /* Check for null result, since caller is clearly not expecting one */ + if (key->fcinfo.isnull) + elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid); + + if (!DatumGetBool(result)) + return false; + } + return true; + }Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?
However, for incremental sort case we don't need to know here whether A > B
or B > A. It's enough for us to know if A = B or A != B. In some cases
it's way cheaper. For instance, for texts equality check is basically
memcmp while comparison may use collation.
Embarrassingly, I was unaware of this patch and started prototyping
exactly the same thing independently[1]. I hadn't got very far and
will now abandon that, but that's one thing I did differently. Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that). I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?
Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1]/messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi, I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
tuples too frequently.
[1]: /messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi
/messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Nov 20, 2017 at 3:34 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Thank you very much for review. I really appreciate this topic gets
attention. Please, find next revision of patch in the attachment.
I would really like to see this get into v11. This is an important
patch, that has fallen through the cracks far too many times.
--
Peter Geoghegan
On Tue, Nov 21, 2017 at 1:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?However, for incremental sort case we don't need to know here whether A > B
or B > A. It's enough for us to know if A = B or A != B. In some cases
it's way cheaper. For instance, for texts equality check is basically
memcmp while comparison may use collation.
Ah, right, of course.
I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1], I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
tuples too frequently.
I see. I wonder if there could ever be a function like
ExecMoveTuple(dst, src). Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership. I haven't really thought that through and
there may be fundamental problems with it...
If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could), (2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns), (3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns. Ok, that last one is probably hard but the first two
might be easier...
--
Thomas Munro
http://www.enterprisedb.com
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Antonin Houska <ah@cybertec.at> wrote:
* ExecIncrementalSort()
** if (node->tuplesortstate == NULL)
If both branches contain the expression
node->groupsCount++;
I suggest it to be moved outside the "if" construct.
Done.
One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set is
to be empty (not sure if it can happen in practice).
And finally one question about regression tests: what's the purpose of the
changes in contrib/postgres_fdw/sql/postgres_fdw.sql ? I see no
IncrementalSort node in the output.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
On Mon, Mar 20, 2017 at 6:33 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Thank you for the report.
Please, find rebased patch in the attachment.
This patch cannot be applied. Please provide a rebased version. I am
moving it to next CF with waiting on author as status.
--
Michael
On Wed, Nov 22, 2017 at 1:22 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Antonin Houska <ah@cybertec.at> wrote:
* ExecIncrementalSort()
** if (node->tuplesortstate == NULL)
If both branches contain the expression
node->groupsCount++;
I suggest it to be moved outside the "if" construct.
Done.
One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set
is
to be empty (not sure if it can happen in practice).
That happens in practice. On empty input set, incremental sort would count
exactly one group.
# create table t (x int, y int);
CREATE TABLE
# create index t_x_idx on t (x);
CREATE INDEX
# set enable_seqscan = off;
SET
# explain (analyze, buffers) select * from t order by x, y;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=0.74..161.14 rows=2260 width=8) (actual
time=0.024..0.024 rows=0 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
Sort Groups: 1
Buffers: shared hit=1
-> Index Scan using t_x_idx on t (cost=0.15..78.06 rows=2260 width=8)
(actual time=0.011..0.011 rows=0 loops=1)
Buffers: shared hit=1
Planning time: 0.088 ms
Execution time: 0.066 ms
(10 rows)
But from prospective of how code works, it's really 1 group. Tuple sort
was defined, inserted no tuples, then sorted and got no tuples out of
there. So, I'm not sure if it's really incorrect...
And finally one question about regression tests: what's the purpose of the
changes in contrib/postgres_fdw/sql/postgres_fdw.sql ? I see no
IncrementalSort node in the output.
But there is IncrementalSort node on the remote side.
Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that
cross join with ORDER BY LIMIT is not beneficial to push down, because
LIMIT is not pushed down and remote side wouldn't be able to use top-N
heapsort. But if remote side has incremental sort then it can be used, and
fetching first 110 rows is cheap. Let's see plan of original "CROSS JOIN,
not pushed down" test with incremental sort.
# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2
t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=160.32..161.31 rows=10 width=46) (actual time=1.918..1.921
rows=10 loops=1)
Output: t1.c3, t2.c3, t1.c1, t2.c1
-> Foreign Scan (cost=150.47..66711.06 rows=675684 width=46) (actual
time=1.684..1.911 rows=110 loops=1)
Output: t1.c3, t2.c3, t1.c1, t2.c1
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Remote SQL: SELECT r1.c3, r1."C 1", r2.c3, r2."C 1" FROM ("S 1"."T
1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS
LAST, r2."C 1" ASC NULLS LAST
Planning time: 1.370 ms
Execution time: 2.068 ms
(8 rows)
And "remote SQL" has following execution plan. This is plan of full
execution while FDW is fetching only first 110 rows out of there.
# EXPLAIN ANALYZE SELECT r1.c3, r1."C 1", r2.c3, r2."C 1" FROM ("S 1"."T 1"
r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST,
r2."C 1" ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=50.47..53097.38 rows=675684 width=34) (actual
time=1.883..747.694 rows=675684 loops=1)
Sort Key: r1."C 1", r2."C 1"
Presorted Key: r1."C 1"
Sort Method: quicksort Memory: 114kB
Sort Groups: 822
-> Nested Loop (cost=0.28..8543.25 rows=675684 width=34) (actual
time=0.027..144.070 rows=675684 loops=1)
-> Index Scan using t1_pkey on "T 1" r1 (cost=0.28..73.93
rows=822 width=17) (actual time=0.015..0.537 rows=822 loops=1)
-> Materialize (cost=0.00..25.33 rows=822 width=17) (actual
time=0.000..0.053 rows=822 loops=822)
-> Seq Scan on "T 1" r2 (cost=0.00..21.22 rows=822
width=17) (actual time=0.007..0.257 rows=822 loops=1)
Planning time: 0.109 ms
Execution time: 785.400 ms
(11 rows)
Thus, with incremental sort this test doesn't do what it was designed to
do. Changing ORDER BY from t1.c1, t2.c1 to t1.c3, t2.c3 fixes this
problem, because there is no index on c3. Query and result are slightly
different, but it serves original design.
Please, find rebased patch attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-11.patchapplication/octet-stream; name=incremental-sort-11.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 1063d92..aa4d7c0
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ---------------------------------------------------------------------
Limit
! Output: t1.c1, t2.c1
-> Sort
! Output: t1.c1, t2.c1
! Sort Key: t1.c1, t2.c1
-> Nested Loop
! Output: t1.c1, t2.c1
-> Foreign Scan on public.ft1 t1
! Output: t1.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-> Materialize
! Output: t2.c1
-> Foreign Scan on public.ft2 t2
! Output: t2.c1
! Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! c1 | c1
! ----+-----
! 1 | 101
! 1 | 102
! 1 | 103
! 1 | 104
! 1 | 105
! 1 | 106
! 1 | 107
! 1 | 108
! 1 | 109
! 1 | 110
(10 rows)
-- different server, not pushed down. No result expected.
--- 1981,2019 ----
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! QUERY PLAN
! ------------------------------------------------------------------
Limit
! Output: t1.c3, t2.c3
-> Sort
! Output: t1.c3, t2.c3
! Sort Key: t1.c3, t2.c3
-> Nested Loop
! Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
! Output: t1.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
! Output: t2.c3
-> Foreign Scan on public.ft2 t2
! Output: t2.c3
! Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! c3 | c3
! -------+-------
! 00001 | 00101
! 00001 | 00102
! 00001 | 00103
! 00001 | 00104
! 00001 | 00105
! 00001 | 00106
! 00001 | 00107
! 00001 | 00108
! 00001 | 00109
! 00001 | 00110
(10 rows)
-- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 0986957..cb46bfa
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 3060597..d0e7c4d
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3553,3558 ****
--- 3553,3572 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 447f69d..a646d82
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual,
*** 80,85 ****
--- 80,87 ----
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1011,1016 ****
--- 1015,1023 ----
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
*************** ExplainNode(PlanState *planstate, List *
*** 1611,1616 ****
--- 1618,1629 ----
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
*************** static void
*** 1936,1950 ****
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
* Likewise, for a MergeAppend node.
*/
static void
--- 1949,1986 ----
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+ {
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+ }
+
+ /*
* Likewise, for a MergeAppend node.
*/
static void
*************** show_merge_append_keys(MergeAppendState
*** 1954,1960 ****
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
--- 1990,1996 ----
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
! plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1978,1984 ****
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
--- 2014,2020 ----
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
*************** show_grouping_set_keys(PlanState *planst
*** 2047,2053 ****
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
--- 2083,2089 ----
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
! sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2104,2110 ****
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
--- 2140,2146 ----
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
! plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2117,2129 ****
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
--- 2153,2166 ----
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
! int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2163,2171 ****
--- 2200,2212 ----
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
*************** show_sort_info(SortState *sortstate, Exp
*** 2374,2379 ****
--- 2415,2509 ----
}
/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+ {
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+ }
+
+ /*
* Show information on hash buckets/batches.
*/
static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index cc09895..572aca0
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
! nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
! nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
! nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 53c5254..f3d6876
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 29,34 ****
--- 29,35 ----
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 263,268 ****
--- 264,273 ----
/* even when not parallel-aware */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 462,467 ****
--- 467,476 ----
/* even when not parallel-aware */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 876,881 ****
--- 885,894 ----
/* even when not parallel-aware */
ExecSortReInitializeDSM((SortState *) planstate, pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 934,939 ****
--- 947,954 ----
*/
if (IsA(planstate, SortState))
ExecSortRetrieveInstrumentation((SortState *) planstate);
+ else if (IsA(planstate, IncrementalSortState))
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 1164,1169 ****
--- 1179,1189 ----
/* even when not parallel-aware */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index da6ef1a..ae9edb9
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false);
}
/*
--- 754,760 ----
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
! work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...1a1e48f
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,649 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+
+ /*
+ * Prepare information for skipKeys comparison.
+ */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+ }
+
+ /*
+ * Check if first "skipCols" sort values are equal.
+ */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+ {
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+ }
+
+ /*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+ #define MIN_GROUP_SIZE 32
+
+ /* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+ }
+
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+ }
+
+ /* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+ /* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+ {
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+ }
+
+ /* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 73aa371..ef3587c
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
plannode->collations,
plannode->nullsFirst,
work_mem,
! node->randomAccess,
! false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index aff9a62..56a5651
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 919,924 ****
--- 919,942 ----
/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+
+ /*
* _copySort
*/
static Sort *
*************** _copySort(const Sort *from)
*** 929,941 ****
/*
* copy node superclass fields
*/
! CopyPlanFields((const Plan *) from, (Plan *) newnode);
! COPY_SCALAR_FIELD(numCols);
! COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
return newnode;
}
--- 947,975 ----
/*
* copy node superclass fields
*/
! CopySortFields(from, newnode);
! return newnode;
! }
!
!
! /*
! * _copyIncrementalSort
! */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! IncrementalSort *newnode = makeNode(IncrementalSort);
!
! /*
! * copy node superclass fields
! */
! CopySortFields((const Sort *) from, (Sort *) newnode);
!
! /*
! * copy remainder of node
! */
! COPY_SCALAR_FIELD(skipCols);
return newnode;
}
*************** copyObjectImpl(const void *from)
*** 4815,4820 ****
--- 4849,4857 ----
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index c97ee24..6cb9300
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 869,880 ****
}
static void
! _outSort(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
--- 869,878 ----
}
static void
! _outSortInfo(StringInfo str, const Sort *node)
{
int i;
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 897,902 ****
--- 895,918 ----
}
static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+ }
+
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+ }
+
+ static void
_outUnique(StringInfo str, const Unique *node)
{
int i;
*************** outNode(StringInfo str, const void *obj)
*** 3737,3742 ****
--- 3753,3761 ----
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 7eb67fc..f2b0e75
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
}
/*
! * _readSort
*/
! static Sort *
! _readSort(void)
{
! READ_LOCALS(Sort);
ReadCommonPlan(&local_node->plan);
--- 2059,2071 ----
}
/*
! * ReadCommonSort
! * Assign the basic stuff of all nodes that inherit from Sort
*/
! static void
! ReadCommonSort(Sort *local_node)
{
! READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+
+ /*
+ * _readSort
+ */
+ static Sort *
+ _readSort(void)
+ {
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+ }
+
+ /*
+ * _readIncrementalSort
+ */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
*************** parseNodeString(void)
*** 2634,2639 ****
--- 2661,2668 ----
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 44f6b03..fbfef2b
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3461,3466 ****
--- 3461,3470 ----
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d11bf19..2f7cf60
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+ bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
*************** cost_recursive_union(Path *runion, Path
*** 1601,1606 ****
--- 1602,1614 ----
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path
*** 1627,1633 ****
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1643 ----
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
! * 'presorted_keys' is a number of pathkeys already presorted in given path
! * 'input_startup_cost' is the startup cost for reading the input data
! * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path
*** 1643,1661 ****
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_cost;
! Cost run_cost = 0;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
path->rows = tuples;
--- 1653,1680 ----
*/
void
cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
! Cost startup_cost = input_startup_cost;
! Cost run_cost = 0,
! rest_cost,
! group_cost,
! input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
output_bytes = input_bytes;
}
! if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(input_bytes / BLCKSZ);
! double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
--- 1700,1749 ----
output_bytes = input_bytes;
}
! /*
! * Estimate number of groups which dataset is divided by presorted keys.
! */
! if (presorted_keys > 0)
! {
! List *presortedExprs = NIL;
! ListCell *l;
! int i = 0;
!
! /* Extract presorted keys as list of expressions */
! foreach(l, pathkeys)
! {
! PathKey *key = (PathKey *)lfirst(l);
! EquivalenceMember *member = (EquivalenceMember *)
! linitial(key->pk_eclass->ec_members);
!
! presortedExprs = lappend(presortedExprs, member->em_expr);
!
! i++;
! if (i >= presorted_keys)
! break;
! }
!
! /* Estimate number of groups with equal presorted keys */
! num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! }
! else
! {
! num_groups = 1.0;
! }
!
! /*
! * Estimate average cost of sorting of one group where presorted keys are
! * equal.
! */
! group_input_bytes = input_bytes / num_groups;
! group_tuples = tuples / num_groups;
! if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
! double npages = ceil(group_input_bytes / BLCKSZ);
! double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
*
* Assume about N log2 N comparisons
*/
! startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
--- 1753,1759 ----
*
* Assume about N log2 N comparisons
*/
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1764,1773 ----
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
! else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
! /* We'll use plain quicksort on all the input tuples */
! startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
--- 1775,1807 ----
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
! group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
! /*
! * We'll use plain quicksort on all the input tuples. If it appears
! * that we expect less than two tuples per sort group then assume
! * logarithmic part of estimate to be 1.
! */
! if (group_tuples >= 2.0)
! group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! else
! group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1812,1830 ----
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2632,2639 ----
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2660,2667 ----
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..b97f22a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
return PATHKEYS_EQUAL;
}
+
+ /*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+ }
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
! * no good to order by just the first key(s) of the requested ordering.
! * So the result is always either 0 or list_length(root->query_pathkeys).
*/
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
! if (root->query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
{
! /* It's useful ... or at least the first N keys are */
! return list_length(root->query_pathkeys);
}
-
- return 0; /* path ordering not useful */
}
/*
--- 1517,1558 ----
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
! * Returns number of pathkeys that maches given argument. Others can be
! * satisfied by incremental sort.
*/
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
! int n_common_pathkeys;
!
! if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
! n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
!
! if (enable_incrementalsort)
{
! /*
! * Return the number of path keys in common, or 0 if there are none. Any
! * first common pathkeys could be useful for ordering because we can use
! * incremental sort.
! */
! return n_common_pathkeys;
! }
! else
! {
! /*
! * When incremental sort is disabled, pathkeys are useful only when they
! * do contain all the query pathkeys.
! */
! if (n_common_pathkeys == list_length(query_pathkeys))
! return n_common_pathkeys;
! else
! return 0;
}
}
/*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
--- 1568,1574 ----
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index d445477..b080fa6
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
--- 1157,1167 ----
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! if (n_common_pathkeys < list_length(pathkeys))
! {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
copy_generic_path_info(&plan->plan, (Path *) best_path);
--- 1673,1685 ----
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
! if (IsA(best_path, IncrementalSortPath))
! n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
! else
! n_common_pathkeys = 0;
!
! plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan);
}
if (!rollup->is_hashed)
--- 1923,1930 ----
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
! subplan,
! 0);
}
if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
*/
if (best_path->outersortkeys)
{
Relids outer_relids = outer_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(outer_plan,
! best_path->outersortkeys,
! outer_relids);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
--- 3864,3878 ----
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! best_path->jpath.outerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
if (best_path->innersortkeys)
{
Relids inner_relids = inner_path->parent->relids;
! Sort *sort = make_sort_from_pathkeys(inner_plan,
! best_path->innersortkeys,
! inner_relids);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
--- 3883,3897 ----
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
!
! n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! best_path->jpath.innerjoinpath->pathkeys);
!
! sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4914,4921 ****
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
! cost_sort(&sort_path, root, NIL,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
--- 4942,4954 ----
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
! if (IsA(plan, IncrementalSort))
! skip_cols = ((IncrementalSort *) plan)->skipCols;
!
! cost_sort(&sort_path, root, NIL, skip_cols,
! lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node = makeNode(Sort);
! Plan *plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
--- 5537,5567 ----
* nullsFirst arrays already.
*/
static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
! Sort *node;
! Plan *plan;
!
! /* Always use regular sort node when enable_incrementalsort = false */
! if (!enable_incrementalsort)
! skipCols = 0;
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass
*** 5843,5851 ****
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
{
int numsortkeys;
AttrNumber *sortColIdx;
--- 5894,5904 ----
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree,
*** 5865,5871 ****
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5918,5924 ----
&nullsFirst);
/* Now build the Sort node */
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 5961,5967 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
--- 5982,5989 ----
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
! Plan *lefttree,
! int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
}
--- 6016,6022 ----
numsortkeys++;
}
! return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
*************** is_projection_capable_plan(Plan *plan)
*** 6619,6624 ****
--- 6673,6679 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index ef2eaea..5b41aaf
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3846,3859 ****
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_partial_path || is_sorted)
{
/* Sort the cheapest partial path, if it isn't already */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3846,3859 ----
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3926,3939 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->group_pathkeys,
! path->pathkeys);
! if (path == cheapest_path || is_sorted)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (!is_sorted)
path = (Path *) create_sort_path(root,
grouped_rel,
path,
--- 3926,3939 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(
! root->group_pathkeys, path->pathkeys);
! if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
! if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
*************** create_ordered_paths(PlannerInfo *root,
*** 5000,5012 ****
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! bool is_sorted;
! is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || is_sorted)
{
! if (!is_sorted)
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
--- 5000,5012 ----
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
! int n_useful_pathkeys;
! n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! path->pathkeys);
! if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
! if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 6136,6143 ****
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL,
! seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
--- 6136,6144 ----
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! cost_sort(&seqScanAndSortPath, root, NIL, 0,
! seqScanPath->startup_cost, seqScanPath->total_cost,
! rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index b5c4124..1ff9d42
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 642,647 ****
--- 642,648 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! cost_sort(&sorted_p, root, NIL, 0,
! sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index bc0841b..d973f8b
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
}
/*
! * compare_path_fractional_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
--- 103,109 ----
}
/*
! * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1570,1576 ****
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
--- 1573,1580 ----
/*
* Estimate cost for sort+unique implementation
*/
! cost_sort(&sort_path, root, NIL, 0,
! subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1663,1668 ****
--- 1667,1673 ----
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1679,1685 ****
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
--- 1684,1692 ----
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
! n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
!
! if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1693,1698 ****
--- 1700,1707 ----
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2549,2557 ****
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode = makeNode(SortPath);
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
--- 2558,2588 ----
List *pathkeys,
double limit_tuples)
{
! SortPath *pathnode;
! int n_common_pathkeys;
!
! if (enable_incrementalsort)
! n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! else
! n_common_pathkeys = 0;
!
! if (n_common_pathkeys == 0)
! {
! pathnode = makeNode(SortPath);
! pathnode->path.pathtype = T_Sort;
! }
! else
! {
! IncrementalSortPath *incpathnode;
!
! incpathnode = makeNode(IncrementalSortPath);
! pathnode = &incpathnode->spath;
! pathnode->path.pathtype = T_IncrementalSort;
! incpathnode->skipCols = n_common_pathkeys;
! }
!
! Assert(n_common_pathkeys < list_length(pathkeys));
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2565,2571 ****
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root, pathkeys,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
--- 2596,2604 ----
pathnode->subpath = subpath;
! cost_sort(&pathnode->path, root,
! pathkeys, n_common_pathkeys,
! subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2877,2883 ****
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL,
0.0,
subpath->rows,
subpath->pathtarget->width,
--- 2910,2917 ----
else
{
/* Account for cost of sort, but don't charge input cost again */
! cost_sort(&sort_path, root, NIL, 0,
! 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
--- 291,298 ----
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
! qstate->rescan_needed,
! false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index ea95b80..abf6c38
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3715,3720 ****
--- 3715,3756 ----
}
/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+ }
+
+ /*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 6dcd738..192d3c8
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
NULL, NULL, NULL
},
{
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 3c23ac7..118edb9
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Create a working memory context for this sort operation. All data
! * needed by the sort will live inside this context.
*/
! sortcontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
! * Memory context surviving tuplesort_reset. This memory context holds
! * data which is useful to keep while sorting multiple similar batches.
*/
! maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
/*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ /*
* Caller tuple (e.g. IndexTuple) memory context.
*
* A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(sortcontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
--- 657,663 ----
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
! oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
AssertArg(nkeys > 0);
--- 716,729 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0);
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
--- 765,771 ----
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
! sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 796,802 ----
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 887,893 ----
MemoryContext oldcontext;
int i;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 962,968 ----
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
--- 1004,1010 ----
int16 typlen;
bool typbyval;
! oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
}
/*
! * tuplesort_end
! *
! * Release resources and clean up.
*
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
*/
! void
! tuplesort_end(Tuplesortstate *state)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
}
/*
! * tuplesort_free
*
! * Internal routine for freeing resources of tuplesort.
*/
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! MemoryContextDelete(state->sortcontext);
}
/*
--- 1179,1276 ----
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
! if (delete)
! {
! MemoryContextDelete(state->maincontext);
! }
! else
! {
! MemoryContextResetOnly(state->sortcontext);
! MemoryContextResetOnly(state->tuplecontext);
! }
! }
!
! /*
! * tuplesort_end
! *
! * Release resources and clean up.
! *
! * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
! * pointing to garbage. Be careful not to attempt to use or free such
! * pointers afterwards!
! */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! tuplesort_free(state, true);
! }
!
! /*
! * tuplesort_updatemax
! *
! * Update maximum resource usage statistics.
! */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! int64 spaceUsed;
! bool spaceUsedOnDisk;
!
! /*
! * Note: it might seem we should provide both memory and disk usage for a
! * disk-based sort. However, the current code doesn't track memory space
! * accurately once we have begun to return tuples to the caller (since we
! * don't account for pfree's the caller is expected to do), so we cannot
! * rely on availMem in a disk sort. This does not seem worth the overhead
! * to fix. Is it worth creating an API for the memory context code to
! * tell us how much is actually used in sortcontext?
! */
! if (state->tapeset)
! {
! spaceUsedOnDisk = true;
! spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! }
! else
! {
! spaceUsedOnDisk = false;
! spaceUsed = state->allowedMem - state->availMem;
! }
!
! if (spaceUsed > state->maxSpace)
! {
! state->maxSpace = spaceUsed;
! state->maxSpaceOnDisk = spaceUsedOnDisk;
! state->maxSpaceStatus = state->status;
! }
! }
!
! /*
! * tuplesort_reset
! *
! * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
! * meta-information in. After tuplesort_reset, tuplesort is ready to start
! * a new sort. It allows evade recreation of tuple sort (and save resources)
! * when sorting multiple small batches.
! */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! tuplesort_updatemax(state);
! tuplesort_free(state, false);
! state->status = TSS_INITIAL;
! state->memtupcount = 0;
! state->boundUsed = false;
! state->tapeset = NULL;
! state->currentRun = 0;
! state->result_tape = -1;
! state->bounded = false;
! state->availMem = state->allowedMem;
! state->lastReturnedTuple = NULL;
! state->slabAllocatorUsed = false;
! state->slabMemoryBegin = NULL;
! state->slabMemoryEnd = NULL;
! state->slabFreeHead = NULL;
! USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2949,2966 ****
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! if (state->tapeset)
! {
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! }
! switch (state->status)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
--- 3059,3073 ----
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
! tuplesort_updatemax(state);
!
! if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
else
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! stats->spaceUsed = (state->maxSpace + 1023) / 1024;
! switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...b2e4e50
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+ #endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+ /* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+ typedef struct SkipKeyData
+ {
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+ } SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+ /* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+ typedef struct IncrementalSortInfo
+ {
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+ } IncrementalSortInfo;
+
+ typedef struct SharedIncrementalSortInfo
+ {
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+
+ /* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+ typedef struct IncrementalSortState
+ {
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+ } IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index c5b5115..9ae5d57
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 9b38d44..0694fb2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+ /* ----------------
+ * incremental sort node
+ * ----------------
+ */
+ typedef struct IncrementalSort
+ {
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+ } IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 51df8e9..a979461
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
} SortPath;
/*
+ * IncrementalSortPath
+ */
+ typedef struct IncrementalSortPath
+ {
+ SortPath spath;
+ int skipCols;
+ } IncrementalSortPath;
+
+
+ /*
* GroupPath represents grouping (of presorted input)
*
* groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+ extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, Cost input_cost, double tuples, int width,
! Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
--- 104,112 ----
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
! List *pathkeys, int presorted_keys,
! Cost input_startup_cost, Cost input_total_cost,
! double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
--- 90,97 ----
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
! int workMem, bool randomAccess,
! bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
extern void tuplesort_end(Tuplesortstate *state);
+ extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort
*** 19,27 ****
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Sort
Sort Key: id, data
! -> Seq Scan on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
--- 19,28 ----
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
! Incremental Sort
Sort Key: id, data
! Presorted Key: id
! -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index fac7b62..18dc749
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE: drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ QUERY PLAN
+ -------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ QUERY PLAN
+ -------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (13 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--- 86,92 ----
enable_seqscan | on
enable_sort | on
enable_tidscan | on
! (14 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index c71febf..5e077a9
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+ set enable_incrementalsort = on;
+
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+
+ explain (costs off)
+ SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+ reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Nov 22, 2017 at 1:22 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Antonin Houska <ah@cybertec.at> wrote:
* ExecIncrementalSort()
** if (node->tuplesortstate == NULL)
If both branches contain the expression
node->groupsCount++;
I suggest it to be moved outside the "if" construct.
Done.
One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set is
to be empty (not sure if it can happen in practice).That happens in practice. On empty input set, incremental sort would count exactly one group.
# create table t (x int, y int);
CREATE TABLE
# create index t_x_idx on t (x);
CREATE INDEX
# set enable_seqscan = off;
SET
# explain (analyze, buffers) select * from t order by x, y;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=0.74..161.14 rows=2260 width=8) (actual time=0.024..0.024 rows=0 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
Sort Groups: 1
Buffers: shared hit=1
-> Index Scan using t_x_idx on t (cost=0.15..78.06 rows=2260 width=8) (actual time=0.011..0.011 rows=0 loops=1)
Buffers: shared hit=1
Planning time: 0.088 ms
Execution time: 0.066 ms
(10 rows)
But from prospective of how code works, it's really 1 group. Tuple sort was defined, inserted no tuples, then sorted and got no tuples out of there. So, I'm not sure if it's really incorrect...
I expected the number of groups actually that actually appear in the output,
you consider it the number of groups started. I can't find similar case
elsewhere in the code (e.g. Agg node does not report this kind of
information), so I have no clue. Someone else will have to decide.
But there is IncrementalSort node on the remote side.
Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that cross join with ORDER BY LIMIT is not beneficial to push down, because LIMIT is not pushed down and remote side wouldn't be able to use top-N heapsort. But if remote side has incremental sort then it can be
used, and fetching first 110 rows is cheap. Let's see plan of original "CROSS JOIN, not pushed down" test with incremental sort.# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
ok, understood, thanks. Perhaps it's worth a comment in the test script.
I'm afraid I still see a problem. The diff removes a query that (although a
bit different from the one above) lets the CROSS JOIN to be pushed down and
does introduce the IncrementalSort in the remote database. This query is
replaced with one that does not allow for the join push down.
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
Shouldn't the test contain *both* cases?
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
On Wed, Nov 22, 2017 at 12:01 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:
I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1], I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copyingsample
tuples too frequently.
I see. I wonder if there could ever be a function like
ExecMoveTuple(dst, src). Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership. I haven't really thought that through and
there may be fundamental problems with it...
ExecMoveTuple(dst, src) would be good. But, it would be hard to implement
"moving a pointer and pin ownership" principle in our current
infrastructure. It's because source and destination can have different
memory contexts. AFAICS, we can't just move memory area between memory
contexts: we have to allocate new area, then memcpy, and then deallocate
old area.
If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could),
In order to notice that every prefix is different, I have to compare every
prefix. But that may introduce an overhead. So, there reason why I
introduced MIN_GROUP_SIZE is exactly to not compare every prefix...
(2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns),
Yes, I did so before. But again, after introducing MIN_GROUP_SIZE, I
missed knowledge whether all the prefixes were the same or different. This
is why, I've to sort by full column list for now...
(3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.
That is interesting idea. The reason we have an overhead in comparison
with plain sort is that we do extra comparison (and copying), but knowledge
of this comparison result is lost for sorting itself. Thus, sorting can
"reuse" prefix comparison, and overhead would be lower. But the problem is
that we have to reformat tuples before putting them into tuplesort. I
wonder if tuple reformatting could eat potential performance win...
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi!
On Fri, Dec 1, 2017 at 11:39 AM, Antonin Houska <ah@cybertec.at> wrote:
I expected the number of groups actually that actually appear in the
output,
you consider it the number of groups started. I can't find similar case
elsewhere in the code (e.g. Agg node does not report this kind of
information), so I have no clue. Someone else will have to decide.
OK.
But there is IncrementalSort node on the remote side.
Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is
that cross join with ORDER BY LIMIT is not beneficial to push down, because
LIMIT is not pushed down and remote side wouldn't be able to use top-N
heapsort. But if remote side has incremental sort then it can beused, and fetching first 110 rows is cheap. Let's see plan of original
"CROSS JOIN, not pushed down" test with incremental sort.
# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN
ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
ok, understood, thanks. Perhaps it's worth a comment in the test script.
I'm afraid I still see a problem. The diff removes a query that (although a
bit different from the one above) lets the CROSS JOIN to be pushed down and
does introduce the IncrementalSort in the remote database. This query is
replaced with one that does not allow for the join push down.*** a/contrib/postgres_fdw/sql/postgres_fdw.sql --- b/contrib/postgres_fdw/sql/postgres_fdw.sql *************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST *** 510,517 **** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10; -- CROSS JOIN, not pushed down EXPLAIN (VERBOSE, COSTS OFF) ! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; ! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; -- different server, not pushed down. No result expected. EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; --- 510,517 ---- SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10; -- CROSS JOIN, not pushed down EXPLAIN (VERBOSE, COSTS OFF) ! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10; ! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10; -- different server, not pushed down. No result expected. EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;Shouldn't the test contain *both* cases?
Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-12.patchapplication/octet-stream; name=incremental-sort-12.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit
Output: t1.c1, t2.c1
- -> Sort
+ -> Foreign Scan
Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
- -> Nested Loop
- Output: t1.c1, t2.c1
- -> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
- -> Materialize
- Output: t2.c1
- -> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
1 | 110
(10 rows)
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
+ Limit
+ Output: t1.c3, t2.c3
+ -> Sort
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
+ -> Nested Loop
+ Output: t1.c3, t2.c3
+ -> Foreign Scan on public.ft1 t1
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+ -> Materialize
+ Output: t2.c3
+ -> Foreign Scan on public.ft2 t2
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 533faf060d..3335fee127 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e4fbafc53..0f993faba4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1936,14 +1949,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index f1636a5b88..dd8cffea9c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 558cb08b07..9cb16ca1b6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -274,6 +275,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -482,6 +487,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -917,6 +926,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortReInitializeDSM((SortState *) planstate, pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
@@ -987,6 +1000,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1231,6 +1247,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9befca9016..7e7e3e666e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -679,6 +685,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index da6ef1a94c..ae9edb96ab 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -666,6 +666,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
@@ -753,7 +754,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1a1e48fb77
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,649 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa3715e6..ef3587c2f0 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- node->randomAccess);
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index b1515dd8e1..b468158a4c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(skipCols);
return newnode;
}
@@ -4816,6 +4850,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b59a5219a7..29dbb7b665 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3738,6 +3754,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0d17ae89b0..baf9ba034c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2074,6 +2075,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
@@ -2635,6 +2662,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 47986ba80a..029617219e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 877827dcb5..440bfbfd6e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1604,6 +1605,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1630,7 +1638,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1646,19 +1656,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1684,13 +1703,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1700,7 +1756,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1711,10 +1767,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1722,14 +1778,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1740,6 +1815,19 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2708,6 +2796,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2734,6 +2824,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index c6870d314e..b97f22a23c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+}
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+ if (enable_incrementalsort)
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /*
+ * Return the number of path keys in common, or 0 if there are none. Any
+ * first common pathkeys could be useful for ordering because we can use
+ * incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when they
+ * do contain all the query pathkeys.
+ */
+ if (n_common_pathkeys == list_length(query_pathkeys))
+ return n_common_pathkeys;
+ else
+ return 0;
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0477..7833c4512b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4916,8 +4944,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ skip_cols = ((IncrementalSort *) plan)->skipCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, skip_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5508,13 +5541,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5847,9 +5898,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5869,7 +5922,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5912,7 +5965,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5933,7 +5986,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5966,7 +6020,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6623,6 +6677,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 889e8af33b..49af1f1912 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e8bc15c35d..726ddd3025 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_partial_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index b5c41241d7..1ff9d42ab4 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 2e3abeea3d..0ee6812e80 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a24e8acfa6..f79523d697 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -989,7 +989,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 54126fbb6a..3b65ccca87 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2601,9 +2610,31 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->skipCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2617,7 +2648,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2929,7 +2962,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 1e323d9444..8f01f05ae5 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ea95b8068d..abf6c3853a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0f7a96d85c..9e4ec22366 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3c23ac75a0..118edb98a4 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -607,18 +617,29 @@ static Tuplesortstate *
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -636,7 +657,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -654,6 +675,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -694,13 +716,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess,
+ bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -742,7 +765,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -773,7 +796,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -864,7 +887,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -939,7 +962,7 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -981,7 +1004,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1092,16 +1115,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
+ * tuplesort_free
*
- * Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1179,98 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ if (spaceUsed > state->maxSpace)
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -2949,18 +3059,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..b2e4e5061f
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c9ad..fba6082f95 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1753,6 +1753,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1781,6 +1795,44 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index c5b5115f5b..9ae5d57449 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366680..d6d15396a2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 1108b6a0ea..2baccda6ff 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1512,6 +1512,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int skipCols;
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 5a1fbf97c3..5e4acebe41 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
@@ -104,8 +105,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ea886b6501..b4370e2621 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 199a6317f5..41b7196adf 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index b6b8c8ef8c..938d329e15 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess,
+ bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 2b738aae7c..896fdfb585 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -86,7 +87,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(14 rows)
+(15 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.
I just found that patch apply is failed according to commitfest.cputube.org.
Please, find rebased patch attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-13.patchapplication/octet-stream; name=incremental-sort-13.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit
Output: t1.c1, t2.c1
- -> Sort
+ -> Foreign Scan
Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
- -> Nested Loop
- Output: t1.c1, t2.c1
- -> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
- -> Materialize
- Output: t2.c1
- -> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
1 | 110
(10 rows)
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
+ Limit
+ Output: t1.c3, t2.c3
+ -> Sort
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
+ -> Nested Loop
+ Output: t1.c3, t2.c3
+ -> Foreign Scan on public.ft1 t1
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+ -> Materialize
+ Output: t2.c3
+ -> Foreign Scan on public.ft2 t2
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..f80d396cfc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1936,14 +1949,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..bc92c3d0e7 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1a1e48fb77
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,649 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list. Thus,
+ * when it's required to sort by (key1, key2 ... keyN) and result is
+ * already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ * values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- node->randomAccess);
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(skipCols);
return newnode;
}
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..99d6938ddc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2074,6 +2075,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
@@ -2636,6 +2663,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..05f58fff79 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,19 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost to detect sort groups.
+ * It turns out into extra copy and comparison for each tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2717,6 +2805,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2743,6 +2833,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+}
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+ if (enable_incrementalsort)
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /*
+ * Return the number of path keys in common, or 0 if there are none. Any
+ * first common pathkeys could be useful for ordering because we can use
+ * incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when they
+ * do contain all the query pathkeys.
+ */
+ if (n_common_pathkeys == list_length(query_pathkeys))
+ return n_common_pathkeys;
+ else
+ return 0;
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ skip_cols = ((IncrementalSort *) plan)->skipCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, skip_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_partial_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->skipCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..80bc67c093 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, fase when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_SIZES);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess,
+ bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
- *
- * Release resources and clean up.
+ * tuplesort_free
*
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ if (spaceUsed > state->maxSpace)
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..b2e4e5061f
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int skipCols;
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess,
+ bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Hello Alexander,
On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:
On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.
I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:
There are quite a few places where lines are broken like so:
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
Or like this:
+ result = (PlanState *) ExecInitIncrementalSort(
+ (IncrementalSort *) node, estate, eflags);
e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?
I'd fix the grammar here:
+ * Incremental sort is specially optimized kind of multikey sort when
+ * input is already presorted by prefix of required keys list.
Like so:
"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."
+ * Consider following example. We have input tuples consisting from
"Consider the following example: We have ..."
+ * In incremental sort case we also have to cost to detect sort groups.
"we also have to cost the detection of sort groups."
"+ * It turns out into extra copy and comparison for each tuple."
"This turns out to be one extra copy and comparison per tuple."
+ "Portions Copyright (c) 1996-2017"
Should probably be 2018 now - time flies fast :)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 7))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?
+ space, fase when it's value for in-memory
typo: "space, false when ..."
+ bool cmp;
+ cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+ if (cmp)
In the above, the variable cmp could be optimized away with:
+ if (cmpSortSkipCols(node, node->sampleSlot, slot))
(not sure if modern compilers won't do this, anway, though)
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set
bounded? */
+ int64 bound; /* if bounded, how many
tuples are needed */
If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.
That's all for now :)
Thank you for your work,
Tels
Hi!
On Fri, Jan 5, 2018 at 2:21 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:
On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:There are quite a few places where lines are broken like so:
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, + pwcxt);
It's quite common practice to align second argument to the same position as
first argument. See other lines nearby.
Or like this:
+ result = (PlanState *) ExecInitIncrementalSort( + (IncrementalSort *) node, estate, eflags);
It was probably not so good idea to insert line break before first
argument. Fixed.
e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?I'd fix the grammar here:
+ * Incremental sort is specially optimized kind of multikey sort when + * input is already presorted by prefix of required keys list.Like so:
"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."+ * Consider following example. We have input tuples
consisting from"Consider the following example: We have ..."
+ * In incremental sort case we also have to cost to detect
sort groups."we also have to cost the detection of sort groups."
"+ * It turns out into extra copy and comparison for each
tuple.""This turns out to be one extra copy and comparison per tuple."
Many thanks for noticing these. Fixed.
+ "Portions Copyright (c) 1996-2017"
Should probably be 2018 now - time flies fast :)
Right. Happy New Year! :)
return_value = _readMaterial(); else if (MATCH("SORT", 4)) return_value = _readSort(); + else if (MATCH("INCREMENTALSORT", 7)) + return_value = _readIncrementalSort(); else if (MATCH("GROUP", 5)) return_value = _readGroup();I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?
Good catch, thank you!
+ space,
fase when it's value for in-memorytypo: "space, false when ..."
Right. Fixed.
+ bool cmp; + cmp = cmpSortSkipCols(node, node->sampleSlot, slot); + + if (cmp)In the above, the variable cmp could be optimized away with:
+ if (cmpSortSkipCols(node, node->sampleSlot, slot))
Right. This comes from time when there was more complicated code which
have to use the cmp variable multiple times.
(not sure if modern compilers won't do this, anway, though)
Anyway, it's code simplification which is good regardless whether compilers
able to do it themselves or not.
+typedef struct IncrementalSortState
+{ + ScanState ss; /* its first field is NodeTag */ + bool bounded; /* is the result set bounded? */ + int64 bound; /* if bounded, how many tuples are needed */If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.
I'd like to leave common members between of SortState and
IncrementalSortState to be ordered the same way.
Thus, I think that if we're going to reorder then we should do this in both
data structures.
But I'm not sure it worth considering, because these data structures are
very unlikely be the source of significant memory consumption...
That's all for now :)
Great, thank you for review.
BTW, I also fixed documentation markup (regarding migration to xml).
Rebased patch is attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-14.patchapplication/octet-stream; name=incremental-sort-14.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit
Output: t1.c1, t2.c1
- -> Sort
+ -> Foreign Scan
Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
- -> Nested Loop
- Output: t1.c1, t2.c1
- -> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
- -> Materialize
- Output: t2.c1
- -> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
1 | 110
(10 rows)
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
+ Limit
+ Output: t1.c3, t2.c3
+ -> Sort
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
+ -> Nested Loop
+ Output: t1.c3, t2.c3
+ -> Foreign Scan on public.ft1 t1
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+ -> Materialize
+ Output: t2.c3
+ -> Foreign Scan on public.ft2 t2
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..fdcdc6683f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1936,14 +1949,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+ estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..a8e55e5e2d
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,646 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is a specially optimized kind of multikey sort used
+ * when the input is already presorted by a prefix of the required keys
+ * list. Thus, when it's required to sort by (key1, key2 ... keyN) and
+ * result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ * where values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider the following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ if (cmpSortSkipCols(node, node->sampleSlot, slot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- node->randomAccess);
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(skipCols);
return newnode;
}
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..9f64d50103 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2074,6 +2075,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
@@ -2636,6 +2663,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 15))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..fd0ba203d5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,20 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost the detection of
+ * sort groups. This turns out to be one extra copy and comparison
+ * per tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2717,6 +2806,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2743,6 +2834,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+}
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+ if (enable_incrementalsort)
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /*
+ * Return the number of path keys in common, or 0 if there are none. Any
+ * first common pathkeys could be useful for ordering because we can use
+ * incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when they
+ * do contain all the query pathkeys.
+ */
+ if (n_common_pathkeys == list_length(query_pathkeys))
+ return n_common_pathkeys;
+ else
+ return 0;
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ skip_cols = ((IncrementalSort *) plan)->skipCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, skip_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_partial_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->skipCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..0265da312b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, false when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_SIZES);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess,
+ bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
- *
- * Release resources and clean up.
+ * tuplesort_free
*
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ if (spaceUsed > state->maxSpace)
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int skipCols;
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess,
+ bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Antonin Houska <ah@cybertec.at> wrote:
Shouldn't the test contain *both* cases?
Thank you for pointing that. Sure, both cases are better. I've added second case as well as comments. Patch is attached.
I'm fine with the tests now but have a minor comment on this comment:
-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
-- can't perform top-N sort like local side can.
I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET
100 LIMIT 10;
and
SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET
100 LIMIT 10;
evaluate the LIMIT clause only locally.
What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort at
all and perform the top-N sort locally than to do a regular (non-incremental)
remote sort.
I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
On Mon, Jan 8, 2018 at 2:29 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Antonin Houska <ah@cybertec.at> wrote:
Shouldn't the test contain *both* cases?
Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.
I'm fine with the tests now but have a minor comment on this comment:
-- CROSS JOIN, not pushed down, because we don't push down LIMIT and
remote side
-- can't perform top-N sort like local side can.I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that bothSELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1
OFFSET
100 LIMIT 10;and
SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3
OFFSET
100 LIMIT 10;evaluate the LIMIT clause only locally.
What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while
incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort
at
all and perform the top-N sort locally than to do a regular
(non-incremental)
remote sort.
Agree, these comments are not clear enough. I've rewritten comments: they
became much
more wordy, but now they look clearer for me. Also I've swapped the
queries order, for me
it seems to easier for understanding.
I have no other questions about this patch. I expect the CFM to set the
status
to "ready for committer" as soon as the other reviewers confirm they're
happy
about the patch status.
Good, thank you. Let's see what other reviewers will say.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-15.patchapplication/octet-stream; name=incremental-sort-15.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..80239faf21 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,28 +1979,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
Limit
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Sort
- Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
-> Nested Loop
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
- Output: t2.c1
+ Output: t2.c3
-> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: t1.c1, t2.c1
+ -> Foreign Scan
+ Output: t1.c1, t2.c1
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..c324394942 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,7 +508,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..fdcdc6683f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1936,14 +1949,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+ estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ false,
false);
}
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..a8e55e5e2d
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,646 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is a specially optimized kind of multikey sort used
+ * when the input is already presorted by a prefix of the required keys
+ * list. Thus, when it's required to sort by (key1, key2 ... keyN) and
+ * result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ * where values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider the following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ if (cmpSortSkipCols(node, node->sampleSlot, slot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * tuple table initialization
+ *
+ * sort nodes only return scan tuples from their sorted relation.
+ */
+ ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * initialize tuple type. no need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- node->randomAccess);
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(skipCols);
return newnode;
}
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..9f64d50103 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2074,6 +2075,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
@@ -2636,6 +2663,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 15))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..fd0ba203d5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,20 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost the detection of
+ * sort groups. This turns out to be one extra copy and comparison
+ * per tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2717,6 +2806,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2743,6 +2834,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+}
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+ if (enable_incrementalsort)
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /*
+ * Return the number of path keys in common, or 0 if there are none. Any
+ * first common pathkeys could be useful for ordering because we can use
+ * incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when they
+ * do contain all the query pathkeys.
+ */
+ if (n_common_pathkeys == list_length(query_pathkeys))
+ return n_common_pathkeys;
+ else
+ return 0;
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ skip_cols = ((IncrementalSort *) plan)->skipCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, skip_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->partial_pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_partial_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_partial_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest partial path, if it isn't already */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->skipCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortCollations,
qstate->sortNullsFirsts,
work_mem,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..0265da312b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, false when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
tuplesort_begin_common(int workMem, bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_SIZES);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess,
+ bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
- *
- * Release resources and clean up.
+ * tuplesort_free
*
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ if (spaceUsed > state->maxSpace)
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int skipCols;
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
extern bool enable_bitmapscan;
extern bool enable_tidscan;
extern bool enable_sort;
+extern bool enable_incrementalsort;
extern bool enable_hashagg;
extern bool enable_nestloop;
extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
List *mergeclauses,
List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess,
+ bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On Mon, Jan 8, 2018 at 10:17 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
I have no other questions about this patch. I expect the CFM to set the
status
to "ready for committer" as soon as the other reviewers confirm they're
happy
about the patch status.Good, thank you. Let's see what other reviewers will say.
Rebased patch is attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-16.patchapplication/octet-stream; name=incremental-sort-16.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 08b30f83e0..669fc82a75 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1997,28 +1997,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
Limit
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Sort
- Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
-> Nested Loop
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
- Output: t2.c1
+ Output: t2.c3
-> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: t1.c1, t2.c1
+ -> Foreign Scan
+ Output: t1.c1, t2.c1
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 7f4d0dab25..0c55c761e9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -511,7 +511,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 00fc364c0a..2596ebe595 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3627,6 +3627,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 900fa74e85..8246a95bfb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1014,6 +1018,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1614,6 +1621,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1939,14 +1952,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int skipCols;
+
+ if (IsA(plan, IncrementalSort))
+ skipCols = ((IncrementalSort *) plan)->skipCols;
+ else
+ skipCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, skipCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->skipCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1957,7 +1993,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1981,7 +2017,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2050,7 +2086,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2107,7 +2143,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2120,13 +2156,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2166,9 +2203,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2376,6 +2417,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..774cfb69d7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_SortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ break;
default:
break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+ estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
- NULL, false);
+ NULL, false, false);
}
aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, NULL, false);
+ work_mem, NULL, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..dc9e6d7cf7
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,643 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is a specially optimized kind of multikey sort used
+ * when the input is already presorted by a prefix of the required keys
+ * list. Thus, when it's required to sort by (key1, key2 ... keyN) and
+ * result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ * where values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider the following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int skipCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ skipCols = plannode->skipCols;
+
+ node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+ for (i = 0; i < skipCols; i++)
+ {
+ Oid equalityOp, equalityFunc;
+ SkipKeyData *key;
+
+ key = &node->skipKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+ for (i = 0; i < n; i++)
+ {
+ Datum datumA, datumB, result;
+ bool isnullA, isnullB;
+ AttrNumber attno = node->skipKeys[i].attno;
+ SkipKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->skipKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortSkipCols - already
+ * sorted columns.
+ */
+ prepareSkipCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ NULL,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->sampleSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ ExecClearTuple(node->sampleSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where skipCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->sampleSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while skip cols are the same as in saved tuple */
+ if (cmpSortSkipCols(node, node->sampleSlot, slot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->sampleSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->sampleSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->skipKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * Initialize scan slot and type.
+ */
+ ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+ /*
+ * Initialize return slot and type. No need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ /* If there's any instrumentation space, clear it for next time */
+ if (node->shared_info != NULL)
+ {
+ memset(node->shared_info->sinfo, 0,
+ node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- NULL, node->randomAccess);
+ NULL,
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 266a3ef8ef..0c9862da75 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(skipCols);
return newnode;
}
@@ -4831,6 +4865,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 011d2a3fa9..116dcc937f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(skipCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3754,6 +3770,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..ddb658b5df 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2080,6 +2081,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(skipCols);
READ_DONE();
}
@@ -2647,6 +2674,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 15))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a00eb..c546dc8862 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3624,6 +3624,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d8db0b29e1..730e69f313 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1694,13 +1713,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1710,7 +1766,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1721,10 +1777,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1788,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1825,20 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost the detection of
+ * sort groups. This turns out to be one extra copy and comparison
+ * per tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2727,6 +2816,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2753,6 +2844,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
@@ -2989,18 +3082,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
* inner path is to be used directly (without sorting) and it doesn't
* support mark/restore.
*
- * Since the inner side must be ordered, and only Sorts and IndexScans can
- * create order to begin with, and they both support mark/restore, you
- * might think there's no problem --- but you'd be wrong. Nestloop and
- * merge joins can *preserve* the order of their inputs, so they can be
- * selected as the input of a mergejoin, and they don't support
- * mark/restore at present.
+ * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+ * Also Nestloop and merge joins can *preserve* the order of their inputs,
+ * so they can be selected as the input of a mergejoin, and they don't
+ * support mark/restore at present.
*
* We don't test the value of enable_material here, because
* materialization is required for correctness in this case, and turning
* it off does not entitle us to deliver an invalid plan.
*/
- else if (innersortkeys == NIL &&
+ else if ((innersortkeys == NIL ||
+ pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
!ExecSupportsMarkRestore(inner_path))
path->materialize_inner = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..cf980ac590 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+ ListCell *key1,
+ *key2;
+ n = 0;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ return n;
+ n++;
+ }
+
+ return n;
+}
+
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -1580,26 +1609,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+ if (enable_incrementalsort)
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /*
+ * Return the number of path keys in common, or 0 if there are none. Any
+ * first common pathkeys could be useful for ordering because we can use
+ * incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when they
+ * do contain all the query pathkeys.
+ */
+ if (n_common_pathkeys == list_length(query_pathkeys))
+ return n_common_pathkeys;
+ else
+ return 0;
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1615,7 +1660,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..e7529b6c04 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int skipCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int skipCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+ if (n_common_pathkeys < list_length(pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int skip_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ skip_cols = ((IncrementalSort *) plan)->skipCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, skip_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ skipCols = 0;
+
+ if (skipCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->skipCols = skipCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'skipCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int skipCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int skipCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, skipCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index de1257d9c2..496024cb16 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4650,13 +4650,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -5786,8 +5786,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
@@ -6023,14 +6024,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -6092,21 +6093,24 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
foreach(lc, partially_grouped_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
+ int n_useful_pathkeys;
/*
* Insert a Sort node, if required. But there's no point in
- * sorting anything but the cheapest path.
+ * non-incremental sorting anything but the cheapest path.
*/
- if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
- {
- if (path != partially_grouped_rel->cheapest_total_path)
- continue;
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (n_useful_pathkeys == 0 &&
+ path != partially_grouped_rel->cheapest_total_path)
+ continue;
+
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
root->group_pathkeys,
-1.0);
- }
if (parse->hasAggs)
add_path(grouped_rel, (Path *)
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b586f941a8..3bce376e38 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -987,7 +987,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fe3b4582d4..b411a70015 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1362,12 +1362,13 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1382,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1628,7 +1631,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1721,6 +1725,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1737,7 +1742,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+ if (n_common_pathkeys == list_length(pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1758,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2610,9 +2619,31 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->skipCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2657,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2938,7 +2971,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortNullsFirsts,
work_mem,
NULL,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..44a30c2430 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..fb17b4f1c5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -243,6 +243,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, false when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -647,6 +654,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
static void worker_nomergeruns(Tuplesortstate *state);
static void leader_takeover_tapes(Tuplesortstate *state);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -682,6 +692,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
@@ -691,13 +702,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
elog(ERROR, "random access disallowed under parallel sort");
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_SIZES);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -715,7 +734,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -740,6 +759,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -807,14 +827,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, SortCoordinate coordinate, bool randomAccess)
+ int workMem, SortCoordinate coordinate,
+ bool randomAccess, bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -857,7 +878,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -890,7 +911,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -985,7 +1006,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1064,7 +1085,7 @@ tuplesort_begin_index_hash(Relation heapRel,
randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1107,7 +1128,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1224,16 +1245,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
- *
- * Release resources and clean up.
+ * tuplesort_free
*
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1311,98 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ if (spaceUsed > state->maxSpace)
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -3137,18 +3245,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820f43..bc158677b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1764,6 +1764,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "skip keys".
+ * SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} SkipKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1792,6 +1806,44 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ SkipKeyData *skipKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal skip keys */
+ TupleTableSlot *sampleSlot; /* slot for sample tuple of sort group */
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..e29a312d4a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int skipCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..9d266888a4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int skipCols;
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
extern PGDLLIMPORT bool enable_hashagg;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..8eaa1bd816 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,7 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -229,6 +230,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
List *mergeclauses,
List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
int workMem, SortCoordinate coordinate,
- bool randomAccess);
+ bool randomAccess, bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel, int workMem,
SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4d5931d67e..cec3b22fb5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
left join
(select * from tenk1 y order by y.unique2) y
on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
- QUERY PLAN
-----------------------------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------------------------------------
Aggregate
-> Merge Left Join
- Merge Cond: (x.thousand = y.unique2)
- Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+ Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
-> Sort
Sort Key: x.thousand, x.twothousand, x.fivethous
-> Seq Scan on tenk1 x
-> Materialize
- -> Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+ -> Incremental Sort
+ Sort Key: y.unique2, y.hundred
+ Presorted Key: y.unique2
+ -> Subquery Scan on y
+ -> Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
select count(*) from
(select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 4fccd9ae54..e0290977f1 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -935,10 +935,12 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
QUERY PLAN
----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
Sort Key: t1.a, t2.b, ((t3.a + t3.b))
+ Presorted Key: t1.a
-> Result
- -> Append
+ -> Merge Append
+ Sort Key: t1.a
-> Merge Left Join
Merge Cond: (t1.a = t2.b)
-> Sort
@@ -987,7 +989,7 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
-> Sort
Sort Key: t2_2.b
-> Seq Scan on prt2_p3 t2_2
-(52 rows)
+(54 rows)
SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
a | c | b | c | ?column? | c
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
Hi,
I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.
Otherwise the patch seems fairly complete. A couple of minor things that
I noticed while eyeballing the changes in a diff editor.
1) On a couple of places the new code has this comment
/* even when not parallel-aware */
while all the immediately preceding blocks use
/* even when not parallel-aware, for EXPLAIN ANALYZE */
I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.
2) I think the purpose of sampleSlot should be explicitly documented
(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).
3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.
4) In cmpSortSkipCols, when checking if the columns changed, the patch
does this:
n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
for (i = 0; i < n; i++)
{
... check i-th key ...
}
My hunch is that checking the keys from the last one, i.e.
for (i = (n-1); i >= 0; i--)
{
....
}
would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.
5) The changes from
if (pathkeys_contained_in(...))
to
n = pathkeys_common(pathkeys, subpath->pathkeys);
if (n == 0)
seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi!
Thank you for reviewing this patch!
Revised version is attached.
On Mon, Mar 5, 2018 at 1:19 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.
Right. Incremental sort patch introduces maincontext of memory which
is persistent between incremental sort groups. But mergeruns()
reallocates memtuples in sortcontext which is cleared by tuplesort_reset().
Fixed in the revised patch.
Otherwise the patch seems fairly complete. A couple of minor things that
I noticed while eyeballing the changes in a diff editor.
1) On a couple of places the new code has this comment
/* even when not parallel-aware */
while all the immediately preceding blocks use
/* even when not parallel-aware, for EXPLAIN ANALYZE */
I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.
Right, fixed. I also found that incremental sort shoudn't support
DSM reinitialization similarly to regular sort. Fixes in the revised patch.
2) I think the purpose of sampleSlot should be explicitly documented
(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).
Yes, "sample" isn't a good term here. However, "first" isn't really
correct,
because we can skip some tuples from beginning of the group in
order to not form groups too frequently. I'd rather name it "pivot" tuple
slot.
3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.
Good point, renamed.
4) In cmpSortSkipCols, when checking if the columns changed, the patch
does this:
n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
for (i = 0; i < n; i++)
{
... check i-th key ...
}My hunch is that checking the keys from the last one, i.e.
for (i = (n-1); i >= 0; i--)
{
....
}would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.
Agreed.
5) The changes from
if (pathkeys_contained_in(...))
to
n = pathkeys_common(pathkeys, subpath->pathkeys);
if (n == 0)
seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.
I would rather say, that it changes from
if (pathkeys_contained_in(...))
to
n = pathkeys_common(pathkeys, subpath->pathkeys);
if (n == list_length(pathkeys))
I've introduced pathkeys_common_contained_in() which returns the same
result as pathkeys_contained_in(), but sets number of common pathkeys
to the last argument. It simplifies code a little bit. The name, probably,
could be improved.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
incremental-sort-17.patchapplication/octet-stream; name=incremental-sort-17.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a2b13846e0..3eab376391 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
119
(10 rows)
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
- QUERY PLAN
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+------------------------------------------------------------------
Limit
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Sort
- Output: t1.c1, t2.c1
- Sort Key: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
+ Sort Key: t1.c3, t2.c3
-> Nested Loop
- Output: t1.c1, t2.c1
+ Output: t1.c3, t2.c3
-> Foreign Scan on public.ft1 t1
- Output: t1.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t1.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
-> Materialize
- Output: t2.c1
+ Output: t2.c3
-> Foreign Scan on public.ft2 t2
- Output: t2.c1
- Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+ Output: t2.c3
+ Remote SQL: SELECT c3 FROM "S 1"."T 1"
(15 rows)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+ c3 | c3
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: t1.c1, t2.c1
+ -> Foreign Scan
+ Output: t1.c1, t2.c1
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
c1 | c1
----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down. For this query, essential optimization is top-N
+-- sort. But it can't be processed at remote side, because we never do LIMIT
+-- push down. Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down. Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort. This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 259a2d83b4..0bc5690ad1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3627,6 +3627,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+ <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of incremental sort
+ steps. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
<term><varname>enable_indexscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 900fa74e85..8366a2212c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
ExplainState *es);
static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
static void show_group_keys(GroupState *gstate, List *ancestors,
ExplainState *es);
static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es);
static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
List *ancestors, ExplainState *es);
static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es);
static void show_hash_info(HashState *hashstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
@@ -1014,6 +1018,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Sort:
pname = sname = "Sort";
break;
+ case T_IncrementalSort:
+ pname = sname = "Incremental Sort";
+ break;
case T_Group:
pname = sname = "Group";
break;
@@ -1614,6 +1621,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_sort_keys(castNode(SortState, planstate), ancestors, es);
show_sort_info(castNode(SortState, planstate), es);
break;
+ case T_IncrementalSort:
+ show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ ancestors, es);
+ show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ es);
+ break;
case T_MergeAppend:
show_merge_append_keys(castNode(MergeAppendState, planstate),
ancestors, es);
@@ -1939,14 +1952,37 @@ static void
show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
{
Sort *plan = (Sort *) sortstate->ss.ps.plan;
+ int presortedCols;
+
+ if (IsA(plan, IncrementalSort))
+ presortedCols = ((IncrementalSort *) plan)->presortedCols;
+ else
+ presortedCols = 0;
show_sort_group_keys((PlanState *) sortstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, presortedCols, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
}
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ List *ancestors, ExplainState *es)
+{
+ IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ plan->sort.numCols, plan->presortedCols,
+ plan->sort.sortColIdx,
+ plan->sort.sortOperators, plan->sort.collations,
+ plan->sort.nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -1957,7 +1993,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
show_sort_group_keys((PlanState *) mstate, "Sort Key",
- plan->numCols, plan->sortColIdx,
+ plan->numCols, 0, plan->sortColIdx,
plan->sortOperators, plan->collations,
plan->nullsFirst,
ancestors, es);
@@ -1981,7 +2017,7 @@ show_agg_keys(AggState *astate, List *ancestors,
show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
else
show_sort_group_keys(outerPlanState(astate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
@@ -2050,7 +2086,7 @@ show_grouping_set_keys(PlanState *planstate,
if (sortnode)
{
show_sort_group_keys(planstate, "Sort Key",
- sortnode->numCols, sortnode->sortColIdx,
+ sortnode->numCols, 0, sortnode->sortColIdx,
sortnode->sortOperators, sortnode->collations,
sortnode->nullsFirst,
ancestors, es);
@@ -2107,7 +2143,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
/* The key columns refer to the tlist of the child plan */
ancestors = lcons(gstate, ancestors);
show_sort_group_keys(outerPlanState(gstate), "Group Key",
- plan->numCols, plan->grpColIdx,
+ plan->numCols, 0, plan->grpColIdx,
NULL, NULL, NULL,
ancestors, es);
ancestors = list_delete_first(ancestors);
@@ -2120,13 +2156,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
*/
static void
show_sort_group_keys(PlanState *planstate, const char *qlabel,
- int nkeys, AttrNumber *keycols,
+ int nkeys, int nPresortedKeys, AttrNumber *keycols,
Oid *sortOperators, Oid *collations, bool *nullsFirst,
List *ancestors, ExplainState *es)
{
Plan *plan = planstate->plan;
List *context;
List *result = NIL;
+ List *resultPresorted = NIL;
StringInfoData sortkeybuf;
bool useprefix;
int keyno;
@@ -2166,9 +2203,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
nullsFirst[keyno]);
/* Emit one property-list item per sort key */
result = lappend(result, pstrdup(sortkeybuf.data));
+ if (keyno < nPresortedKeys)
+ resultPresorted = lappend(resultPresorted, exprstr);
}
ExplainPropertyList(qlabel, result, es);
+ if (nPresortedKeys > 0)
+ ExplainPropertyList("Presorted Key", resultPresorted, es);
}
/*
@@ -2376,6 +2417,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
}
}
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ ExplainState *es)
+{
+ if (es->analyze && incrsortstate->sort_Done &&
+ incrsortstate->tuplesortstate != NULL)
+ {
+ Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ TuplesortInstrumentation stats;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+
+ tuplesort_get_stats(state, &stats);
+ sortMethod = tuplesort_method_name(stats.sortMethod);
+ spaceType = tuplesort_space_type_name(stats.spaceType);
+ spaceUsed = stats.spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
+ sortMethod, spaceType, spaceUsed);
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str, "Sort Groups: %ld\n",
+ incrsortstate->groupsCount);
+ }
+ else
+ {
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups: %ld",
+ incrsortstate->groupsCount, es);
+ }
+ }
+
+ if (incrsortstate->shared_info != NULL)
+ {
+ int n;
+ bool opened_group = false;
+
+ for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ {
+ TuplesortInstrumentation *sinstrument;
+ const char *sortMethod;
+ const char *spaceType;
+ long spaceUsed;
+ int64 groupsCount;
+
+ sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ continue; /* ignore any unfilled slots */
+ sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ spaceUsed = sinstrument->spaceUsed;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ appendStringInfoSpaces(es->str, es->indent * 2);
+ appendStringInfo(es->str,
+ "Worker %d: Sort Method: %s %s: %ldkB Groups: %ld\n",
+ n, sortMethod, spaceType, spaceUsed, groupsCount);
+ }
+ else
+ {
+ if (!opened_group)
+ {
+ ExplainOpenGroup("Workers", "Workers", false, es);
+ opened_group = true;
+ }
+ ExplainOpenGroup("Worker", NULL, true, es);
+ ExplainPropertyInteger("Worker Number", n, es);
+ ExplainPropertyText("Sort Method", sortMethod, es);
+ ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyLong("Sort Groups", groupsCount, es);
+ ExplainCloseGroup("Worker", NULL, true, es);
+ }
+ }
+ if (opened_group)
+ ExplainCloseGroup("Workers", "Workers", false, es);
+ }
+}
+
/*
* Show information on hash buckets/batches.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
- nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
- nodeValuesscan.o \
+ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+ nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
ExecReScanSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecReScanIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecReScanGroup((GroupState *) node);
break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
case T_CteScan:
case T_Material:
case T_Sort:
+ /* these don't evaluate tlist */
return true;
+ case T_IncrementalSort:
+ return false;
+
case T_LockRows:
case T_Limit:
return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..6c597c5b20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
#include "executor/nodeForeignscan.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortEstimate((SortState *) planstate, e->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware, for EXPLAIN ANALYZE */
+ ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ break;
default:
break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware, for EXPLAIN ANALYZE */
+ ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ break;
default:
break;
@@ -916,6 +925,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
break;
case T_HashState:
case T_SortState:
+ case T_IncrementalSortState:
/* these nodes have DSM state, but no reinitialization is required */
break;
@@ -976,6 +986,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
case T_SortState:
ExecSortRetrieveInstrumentation((SortState *) planstate);
break;
+ case T_IncrementalSortState:
+ ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+ break;
case T_HashState:
ExecHashRetrieveInstrumentation((HashState *) planstate);
break;
@@ -1225,6 +1238,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecSortInitializeWorker((SortState *) planstate, pwcxt);
break;
+ case T_IncrementalSortState:
+ /* even when not parallel-aware, for EXPLAIN ANALYZE */
+ ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ pwcxt);
+ break;
default:
break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
#include "executor/nodeGroup.h"
#include "executor/nodeHash.h"
#include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
#include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_IncrementalSort:
+ result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+ estate, eflags);
+ break;
+
case T_Group:
result = (PlanState *) ExecInitGroup((Group *) node,
estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
ExecEndSort((SortState *) node);
break;
+ case T_IncrementalSortState:
+ ExecEndIncrementalSort((IncrementalSortState *) node);
+ break;
+
case T_GroupState:
ExecEndGroup((GroupState *) node);
break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
- NULL, false);
+ NULL, false, false);
}
aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, NULL, false);
+ work_mem, NULL, false, false);
}
/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ * Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ * Incremental sort is a specially optimized kind of multikey sort used
+ * when the input is already presorted by a prefix of the required keys
+ * list. Thus, when it's required to sort by (key1, key2 ... keyN) and
+ * result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ * where values of (key1, key2 ... keyM) are equal.
+ *
+ * Consider the following example. We have input tuples consisting from
+ * two integers (x, y) already presorted by x, while it's required to
+ * sort them by x and y. Let input tuples be following.
+ *
+ * (1, 5)
+ * (1, 2)
+ * (2, 10)
+ * (2, 1)
+ * (2, 5)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort algorithm would sort by y following groups, which have
+ * equal x, individually:
+ * (1, 5) (1, 2)
+ * (2, 10) (2, 1) (2, 5)
+ * (3, 3) (3, 7)
+ *
+ * After sorting these groups and putting them altogether, we would get
+ * following tuple set which is actually sorted by x and y.
+ *
+ * (1, 2)
+ * (1, 5)
+ * (2, 1)
+ * (2, 5)
+ * (2, 10)
+ * (3, 3)
+ * (3, 7)
+ *
+ * Incremental sort is faster than full sort on large datasets. But
+ * the case of most huge benefit of incremental sort is queries with
+ * LIMIT because incremental sort can return first tuples without reading
+ * whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ int presortedCols,
+ i;
+
+ Assert(IsA(plannode, IncrementalSort));
+ presortedCols = plannode->presortedCols;
+
+ node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+ sizeof(PresortedKeyData));
+
+ for (i = 0; i < presortedCols; i++)
+ {
+ Oid equalityOp,
+ equalityFunc;
+ PresortedKeyData *key;
+
+ key = &node->presortedKeys[i];
+ key->attno = plannode->sort.sortColIdx[i];
+
+ equalityOp = get_equality_op_for_ordering_op(
+ plannode->sort.sortOperators[i], NULL);
+ if (!OidIsValid(equalityOp))
+ elog(ERROR, "missing equality operator for ordering operator %u",
+ plannode->sort.sortOperators[i]);
+
+ equalityFunc = get_opcode(equalityOp);
+ if (!OidIsValid(equalityFunc))
+ elog(ERROR, "missing function for operator %u", equalityOp);
+
+ /* Lookup the comparison function */
+ fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+ /* We can initialize the callinfo just once and re-use it */
+ InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ plannode->sort.collations[i], NULL, NULL);
+ key->fcinfo.argnull[0] = false;
+ key->fcinfo.argnull[1] = false;
+ }
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+ TupleTableSlot *b)
+{
+ int n, i;
+
+ Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+ n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+ for (i = n - 1; i >= 0; i--)
+ {
+ Datum datumA,
+ datumB,
+ result;
+ bool isnullA,
+ isnullB;
+ AttrNumber attno = node->presortedKeys[i].attno;
+ PresortedKeyData *key;
+
+ datumA = slot_getattr(a, attno, &isnullA);
+ datumB = slot_getattr(b, attno, &isnullB);
+
+ /* Special case for NULL-vs-NULL, else use standard comparison */
+ if (isnullA || isnullB)
+ {
+ if (isnullA == isnullB)
+ continue;
+ else
+ return false;
+ }
+
+ key = &node->presortedKeys[i];
+
+ key->fcinfo.arg[0] = datumA;
+ key->fcinfo.arg[1] = datumB;
+
+ /* just for paranoia's sake, we reset isnull each time */
+ key->fcinfo.isnull = false;
+
+ result = FunctionCallInvoke(&key->fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (key->fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+ if (!DatumGetBool(result))
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead. It's
+ * especially notable when groups are containing one or few tuples. In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples. Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ * ExecIncrementalSort
+ *
+ * Assuming that outer subtree returns tuple presorted by some prefix
+ * of target sort columns, performs incremental sort. It fetches
+ * groups of tuples where prefix sort columns are equal and sorts them
+ * using tuplesort. This approach allows to evade sorting of whole
+ * dataset. Besides taking less memory and being faster, it allows to
+ * start returning tuples before fetching full dataset from outer
+ * subtree.
+ *
+ * Conditions:
+ * -- none.
+ *
+ * Initial States:
+ * -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+ IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+ IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+ int64 nTuples = 0;
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * Return next tuple from sorted set if any.
+ */
+ if (node->sort_Done)
+ {
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ if (tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL) || node->finished)
+ return slot;
+ }
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ if (node->tuplesortstate == NULL)
+ {
+ /*
+ * We are going to process the first group of presorted data.
+ * Initialize support structures for cmpSortPresortedCols - already
+ * sorted columns.
+ */
+ preparePresortedCols(node);
+
+ /*
+ * Pass all the columns to tuplesort. We pass to tuple sort groups
+ * of at least MIN_GROUP_SIZE size. Thus, these groups doesn't
+ * necessary have equal value of the first column. We unlikely will
+ * have huge groups with incremental sort. Therefore usage of
+ * abbreviated keys would be likely a waste of time.
+ */
+ tuplesortstate = tuplesort_begin_heap(
+ tupDesc,
+ plannode->sort.numCols,
+ plannode->sort.sortColIdx,
+ plannode->sort.sortOperators,
+ plannode->sort.collations,
+ plannode->sort.nullsFirst,
+ work_mem,
+ NULL,
+ false,
+ true);
+ node->tuplesortstate = (void *) tuplesortstate;
+ }
+ else
+ {
+ /* Next group of presorted data */
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ }
+ node->groupsCount++;
+
+ /* Calculate remaining bound for bounded sort */
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+ /* Put saved tuple to tuplesort if any */
+ if (!TupIsNull(node->grpPivotSlot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+ ExecClearTuple(node->grpPivotSlot);
+ nTuples++;
+ }
+
+ /*
+ * Put next group of tuples where presortedCols sort values are equal to
+ * tuplesort.
+ */
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ {
+ node->finished = true;
+ break;
+ }
+
+ /* Put next group of presorted data to the tuplesort */
+ if (nTuples < MIN_GROUP_SIZE)
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+
+ /* Save last tuple in minimal group */
+ if (nTuples == MIN_GROUP_SIZE - 1)
+ ExecCopySlot(node->grpPivotSlot, slot);
+ nTuples++;
+ }
+ else
+ {
+ /* Iterate while presorted cols are the same as in saved tuple */
+ if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+ {
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ nTuples++;
+ }
+ else
+ {
+ ExecCopySlot(node->grpPivotSlot, slot);
+ break;
+ }
+ }
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ tuplesort_get_stats(tuplesortstate, si);
+ node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ node->groupsCount;
+ }
+
+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ }
+
+ SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+ SO1_printf("ExecIncrementalSort: %s\n",
+ "retrieving tuple from tuplesort");
+
+ /*
+ * Get the first or next tuple from tuplesort. Returns NULL if no more
+ * tuples.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ return slot;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitIncrementalSort
+ *
+ * Creates the run-time state information for the sort node
+ * produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+ IncrementalSortState *incrsortstate;
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "initializing sort node");
+
+ /*
+ * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ * bucket in tuplesortstate.
+ */
+ Assert((eflags & (EXEC_FLAG_REWIND |
+ EXEC_FLAG_BACKWARD |
+ EXEC_FLAG_MARK)) == 0);
+
+ /*
+ * create state structure
+ */
+ incrsortstate = makeNode(IncrementalSortState);
+ incrsortstate->ss.ps.plan = (Plan *) node;
+ incrsortstate->ss.ps.state = estate;
+ incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+ incrsortstate->bounded = false;
+ incrsortstate->sort_Done = false;
+ incrsortstate->finished = false;
+ incrsortstate->tuplesortstate = NULL;
+ incrsortstate->grpPivotSlot = NULL;
+ incrsortstate->bound_Done = 0;
+ incrsortstate->groupsCount = 0;
+ incrsortstate->presortedKeys = NULL;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * Sort nodes don't initialize their ExprContexts because they never call
+ * ExecQual or ExecProject.
+ */
+
+ /*
+ * initialize child nodes
+ *
+ * We shield the child node from the need to support REWIND, BACKWARD, or
+ * MARK/RESTORE.
+ */
+ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+ outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+ /*
+ * Initialize scan slot and type.
+ */
+ ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+ /*
+ * Initialize return slot and type. No need to initialize projection info because
+ * this node doesn't do projections.
+ */
+ ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+ incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /* make standalone slot to store previous tuple from outer node */
+ incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+ ExecGetResultType(outerPlanState(incrsortstate)));
+
+ SO1_printf("ExecInitIncrementalSort: %s\n",
+ "sort node initialized");
+
+ return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ * ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "shutting down sort node");
+
+ /*
+ * clean out the tuple table
+ */
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ /* must drop stanalone tuple slot from outer node */
+ ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+ /*
+ * Release tuplesort resources
+ */
+ if (node->tuplesortstate != NULL)
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+
+ /*
+ * shut down the subplan
+ */
+ ExecEndNode(outerPlanState(node));
+
+ SO1_printf("ExecEndIncrementalSort: %s\n",
+ "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+ PlanState *outerPlan = outerPlanState(node);
+
+ /*
+ * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ * re-scan it at all.
+ */
+ if (!node->sort_Done)
+ return;
+
+ /* must drop pointer to sort result tuple */
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+ /*
+ * If subnode is to be rescanned then we forget previous sort results; we
+ * have to re-read the subplan and re-sort. Also must re-sort if the
+ * bounded-sort parameters changed or we didn't select randomAccess.
+ *
+ * Otherwise we can just rewind and rescan the sorted output.
+ */
+ node->sort_Done = false;
+ tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ node->tuplesortstate = NULL;
+ node->bound_Done = 0;
+
+ /*
+ * if chgParam of subnode is not null then plan will be re-scanned by
+ * first ExecProcNode.
+ */
+ if (outerPlan->chgParam == NULL)
+ ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecSortEstimate
+ *
+ * Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ shm_toc_estimate_chunk(&pcxt->estimator, size);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeDSM
+ *
+ * Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+ Size size;
+
+ /* don't need this if not instrumenting or no workers */
+ if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + pcxt->nworkers * sizeof(IncrementalSortInfo);
+ node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ /* ensure any unfilled slots will contain zeroes */
+ memset(node->shared_info, 0, size);
+ node->shared_info->num_workers = pcxt->nworkers;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortInitializeWorker
+ *
+ * Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+ node->shared_info =
+ shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecSortRetrieveInstrumentation
+ *
+ * Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+ Size size;
+ SharedIncrementalSortInfo *si;
+
+ if (node->shared_info == NULL)
+ return;
+
+ size = offsetof(SharedIncrementalSortInfo, sinfo)
+ + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ si = palloc(size);
+ memcpy(si, node->shared_info, size);
+ node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
plannode->collations,
plannode->nullsFirst,
work_mem,
- NULL, node->randomAccess);
+ NULL,
+ node->randomAccess,
+ false);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 266a3ef8ef..a17a24b62b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
}
+/*
+ * CopySortFields
+ *
+ * This function copies the fields of the Sort node. It is used by
+ * all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+ CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+ COPY_SCALAR_FIELD(numCols);
+ COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
/*
* _copySort
*/
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
/*
* copy node superclass fields
*/
- CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ CopySortFields(from, newnode);
- COPY_SCALAR_FIELD(numCols);
- COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
- COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
- COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+ IncrementalSort *newnode = makeNode(IncrementalSort);
+
+ /*
+ * copy node superclass fields
+ */
+ CopySortFields((const Sort *) from, (Sort *) newnode);
+
+ /*
+ * copy remainder of node
+ */
+ COPY_SCALAR_FIELD(presortedCols);
return newnode;
}
@@ -4831,6 +4865,9 @@ copyObjectImpl(const void *from)
case T_Sort:
retval = _copySort(from);
break;
+ case T_IncrementalSort:
+ retval = _copyIncrementalSort(from);
+ break;
case T_Group:
retval = _copyGroup(from);
break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 011d2a3fa9..6666dd0a82 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
}
static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
{
int i;
- WRITE_NODE_TYPE("SORT");
-
_outPlanInfo(str, (const Plan *) node);
WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
}
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+ WRITE_NODE_TYPE("SORT");
+
+ _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+ WRITE_NODE_TYPE("INCREMENTALSORT");
+
+ _outSortInfo(str, (const Sort *) node);
+
+ WRITE_INT_FIELD(presortedCols);
+}
+
static void
_outUnique(StringInfo str, const Unique *node)
{
@@ -3754,6 +3770,9 @@ outNode(StringInfo str, const void *obj)
case T_Sort:
_outSort(str, obj);
break;
+ case T_IncrementalSort:
+ _outIncrementalSort(str, obj);
+ break;
case T_Unique:
_outUnique(str, obj);
break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..c50365c56a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
}
/*
- * _readSort
+ * ReadCommonSort
+ * Assign the basic stuff of all nodes that inherit from Sort
*/
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
{
- READ_LOCALS(Sort);
+ READ_TEMP_LOCALS();
ReadCommonPlan(&local_node->plan);
@@ -2080,6 +2081,32 @@ _readSort(void)
READ_OID_ARRAY(sortOperators, local_node->numCols);
READ_OID_ARRAY(collations, local_node->numCols);
READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+ READ_LOCALS_NO_FIELDS(Sort);
+
+ ReadCommonSort(local_node);
+
+ READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+ READ_LOCALS(IncrementalSort);
+
+ ReadCommonSort(&local_node->sort);
+
+ READ_INT_FIELD(presortedCols);
READ_DONE();
}
@@ -2647,6 +2674,8 @@ parseNodeString(void)
return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+ else if (MATCH("INCREMENTALSORT", 15))
+ return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a00eb..c546dc8862 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3624,6 +3624,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
ptype = "Sort";
subpath = ((SortPath *) path)->subpath;
break;
+ case T_IncrementalSortPath:
+ ptype = "IncrementalSort";
+ subpath = ((SortPath *) path)->subpath;
+ break;
case T_GroupPath:
ptype = "Group";
subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d8db0b29e1..730e69f313 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool enable_indexonlyscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
+bool enable_incrementalsort = true;
bool enable_hashagg = true;
bool enable_nestloop = true;
bool enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys. In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys. And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly. Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
* If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
* work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
* 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
*/
void
cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- Cost startup_cost = input_cost;
- Cost run_cost = 0;
+ Cost startup_cost = input_startup_cost;
+ Cost run_cost = 0,
+ rest_cost,
+ group_cost,
+ input_run_cost = input_total_cost - input_startup_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
+ double num_groups,
+ group_input_bytes,
+ group_tuples;
long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
+ if (!enable_incrementalsort)
+ presorted_keys = 0;
path->rows = tuples;
@@ -1694,13 +1713,50 @@ cost_sort(Path *path, PlannerInfo *root,
output_bytes = input_bytes;
}
- if (output_bytes > sort_mem_bytes)
+ /*
+ * Estimate number of groups which dataset is divided by presorted keys.
+ */
+ if (presorted_keys > 0)
+ {
+ List *presortedExprs = NIL;
+ ListCell *l;
+ int i = 0;
+
+ /* Extract presorted keys as list of expressions */
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ presortedExprs = lappend(presortedExprs, member->em_expr);
+
+ i++;
+ if (i >= presorted_keys)
+ break;
+ }
+
+ /* Estimate number of groups with equal presorted keys */
+ num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+ }
+ else
+ {
+ num_groups = 1.0;
+ }
+
+ /*
+ * Estimate average cost of sorting of one group where presorted keys are
+ * equal.
+ */
+ group_input_bytes = input_bytes / num_groups;
+ group_tuples = tuples / num_groups;
+ if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
- double npages = ceil(input_bytes / BLCKSZ);
- double nruns = input_bytes / sort_mem_bytes;
+ double npages = ceil(group_input_bytes / BLCKSZ);
+ double nruns = group_input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
@@ -1710,7 +1766,7 @@ cost_sort(Path *path, PlannerInfo *root,
*
* Assume about N log2 N comparisons
*/
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
/* Disk costs */
@@ -1721,10 +1777,10 @@ cost_sort(Path *path, PlannerInfo *root,
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
- startup_cost += npageaccesses *
+ group_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1788,33 @@ cost_sort(Path *path, PlannerInfo *root,
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+ group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
}
else
{
- /* We'll use plain quicksort on all the input tuples */
- startup_cost += comparison_cost * tuples * LOG2(tuples);
+ /*
+ * We'll use plain quicksort on all the input tuples. If it appears
+ * that we expect less than two tuples per sort group then assume
+ * logarithmic part of estimate to be 1.
+ */
+ if (group_tuples >= 2.0)
+ group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+ else
+ group_cost = comparison_cost * group_tuples;
}
+ /* Add per group cost of fetching tuples from input */
+ group_cost += input_run_cost / num_groups;
+
+ /*
+ * We've to sort first group to start output from node. Sorting rest of
+ * groups are required to return all the other tuples.
+ */
+ startup_cost += group_cost;
+ rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ if (rest_cost > 0.0)
+ run_cost += rest_cost;
+
/*
* Also charge a small amount (arbitrarily set equal to operator cost) per
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1825,20 @@ cost_sort(Path *path, PlannerInfo *root,
*/
run_cost += cpu_operator_cost * tuples;
+ /* Extra costs of incremental sort */
+ if (presorted_keys > 0)
+ {
+ /*
+ * In incremental sort case we also have to cost the detection of
+ * sort groups. This turns out to be one extra copy and comparison
+ * per tuple.
+ */
+ run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+ /* Cost of per group tuplesort reset */
+ run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
@@ -2727,6 +2816,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
outersortkeys,
+ pathkeys_common(outer_path->pathkeys, outersortkeys),
+ outer_path->startup_cost,
outer_path->total_cost,
outer_path_rows,
outer_path->pathtarget->width,
@@ -2753,6 +2844,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
cost_sort(&sort_path,
root,
innersortkeys,
+ pathkeys_common(inner_path->pathkeys, innersortkeys),
+ inner_path->startup_cost,
inner_path->total_cost,
inner_path_rows,
inner_path->pathtarget->width,
@@ -2989,18 +3082,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
* inner path is to be used directly (without sorting) and it doesn't
* support mark/restore.
*
- * Since the inner side must be ordered, and only Sorts and IndexScans can
- * create order to begin with, and they both support mark/restore, you
- * might think there's no problem --- but you'd be wrong. Nestloop and
- * merge joins can *preserve* the order of their inputs, so they can be
- * selected as the input of a mergejoin, and they don't support
- * mark/restore at present.
+ * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+ * Also Nestloop and merge joins can *preserve* the order of their inputs,
+ * so they can be selected as the input of a mergejoin, and they don't
+ * support mark/restore at present.
*
* We don't test the value of enable_material here, because
* materialization is required for correctness in this case, and turning
* it off does not entitle us to deliver an invalid plan.
*/
- else if (innersortkeys == NIL &&
+ else if ((innersortkeys == NIL ||
+ pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
!ExecSupportsMarkRestore(inner_path))
path->materialize_inner = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..869c7c0b16 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
#include "nodes/nodeFuncs.h"
#include "nodes/plannodes.h"
#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/tlist.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,7 @@ compare_pathkeys(List *keys1, List *keys2)
return PATHKEYS_EQUAL;
}
+
/*
* pathkeys_contained_in
* Common special case of compare_pathkeys: we just want to know
@@ -327,6 +330,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
return false;
}
+
+/*
+ * pathkeys_common_contained_in
+ * Same as pathkeys_contained_in, but also sets length of longest
+ * common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+ int n = 0;
+ ListCell *key1,
+ *key2;
+
+ forboth(key1, keys1, key2, keys2)
+ {
+ PathKey *pathkey1 = (PathKey *) lfirst(key1);
+ PathKey *pathkey2 = (PathKey *) lfirst(key2);
+
+ if (pathkey1 != pathkey2)
+ {
+ *n_common = n;
+ return false;
+ }
+ n++;
+ }
+
+ *n_common = n;
+ return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ * Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+ int n;
+
+ (void) pathkeys_common_contained_in(keys1, keys2, &n);
+ return n;
+}
+
+
/*
* get_cheapest_path_for_pathkeys
* Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1628,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
* Count the number of pathkeys that are useful for meeting the
* query's requested output ordering.
*
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
*/
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
{
- if (root->query_pathkeys == NIL)
+ int n_common_pathkeys;
+
+ if (query_pathkeys == NIL)
return 0; /* no special ordering requested */
if (pathkeys == NIL)
return 0; /* unordered path */
- if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+ if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
{
- /* It's useful ... or at least the first N keys are */
- return list_length(root->query_pathkeys);
+ /* Full match of pathkeys: always useful */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ if (enable_incrementalsort)
+ {
+ /*
+ * Return the number of path keys in common, or 0 if there are none.
+ * Any leading common pathkeys could be useful for ordering because
+ * we can use the incremental sort.
+ */
+ return n_common_pathkeys;
+ }
+ else
+ {
+ /*
+ * When incremental sort is disabled, pathkeys are useful only when
+ * they do contain all the query pathkeys.
+ */
+ return 0;
+ }
}
-
- return 0; /* path ordering not useful */
}
/*
@@ -1615,7 +1682,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
int nuseful2;
nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
- nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+ nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
if (nuseful2 > nuseful)
nuseful = nuseful2;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..30b91bd5bc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
Plan *lefttree, Plan *righttree,
JoinType jointype, bool inner_unique,
bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst);
static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
TargetEntry *tle,
Relids relids);
static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
- Relids relids);
+ Relids relids, int presortedCols);
static Sort *make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree);
+ Plan *lefttree,
+ int presortedCols);
static Material *make_material(Plan *lefttree);
static WindowAgg *make_windowagg(List *tlist, Index winref,
int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
(GatherPath *) best_path);
break;
case T_Sort:
+ case T_IncrementalSort:
plan = (Plan *) create_sort_plan(root,
(SortPath *) best_path,
flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int n_common_pathkeys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+ &n_common_pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
+ n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
Plan *subplan;
List *pathkeys = best_path->path.pathkeys;
List *tlist = build_path_tlist(root, &best_path->path);
+ int n_common_pathkeys;
/* As with Gather, it's best to project away columns in the workers. */
subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+ if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+ &n_common_pathkeys))
+ {
subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ n_common_pathkeys,
gm_plan->sortColIdx,
gm_plan->sortOperators,
gm_plan->collations,
gm_plan->nullsFirst);
+ }
/* Now insert the subplan under GatherMerge. */
gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
{
Sort *plan;
Plan *subplan;
+ int n_common_pathkeys;
/*
* We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST);
- plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+ if (IsA(best_path, IncrementalSortPath))
+ n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+ else
+ n_common_pathkeys = 0;
+
+ plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+ NULL, n_common_pathkeys);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
sort_plan = (Plan *)
make_sort_from_groupcols(rollup->groupClause,
new_grpColIdx,
- subplan);
+ subplan,
+ 0);
}
if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
*/
if (best_path->outersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys,
- outer_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+ best_path->jpath.outerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+ outer_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ Sort *sort;
+ int n_common_pathkeys;
Relids inner_relids = inner_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(inner_plan,
- best_path->innersortkeys,
- inner_relids);
+
+ n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+ best_path->jpath.innerjoinpath->pathkeys);
+
+ sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+ inner_relids, n_common_pathkeys);
label_sort_with_costsize(root, sort, -1.0);
inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
+ int presorted_cols = 0;
+
+ if (IsA(plan, IncrementalSort))
+ presorted_cols = ((IncrementalSort *) plan)->presortedCols;
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, presorted_cols,
+ lefttree->startup_cost,
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
* nullsFirst arrays already.
*/
static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
AttrNumber *sortColIdx, Oid *sortOperators,
Oid *collations, bool *nullsFirst)
{
- Sort *node = makeNode(Sort);
- Plan *plan = &node->plan;
+ Sort *node;
+ Plan *plan;
+
+ /* Always use regular sort node when enable_incrementalsort = false */
+ if (!enable_incrementalsort)
+ presortedCols = 0;
+
+ if (presortedCols == 0)
+ {
+ node = makeNode(Sort);
+ }
+ else
+ {
+ IncrementalSort *incrementalSort;
+
+ incrementalSort = makeNode(IncrementalSort);
+ node = &incrementalSort->sort;
+ incrementalSort->presortedCols = presortedCols;
+ }
+ plan = &node->plan;
plan->targetlist = lefttree->targetlist;
plan->qual = NIL;
plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
* 'lefttree' is the node which yields input tuples
* 'pathkeys' is the list of pathkeys by which the result is to be sorted
* 'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ * 'presortedCols' is the number of presorted columns in input tuples
*/
static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+ Relids relids, int presortedCols)
{
int numsortkeys;
AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
&nullsFirst);
/* Now build the Sort node */
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, presortedCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, 0,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
static Sort *
make_sort_from_groupcols(List *groupcls,
AttrNumber *grpColIdx,
- Plan *lefttree)
+ Plan *lefttree,
+ int presortedCols)
{
List *sub_tlist = lefttree->targetlist;
ListCell *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
numsortkeys++;
}
- return make_sort(lefttree, numsortkeys,
+ return make_sort(lefttree, numsortkeys, presortedCols,
sortColIdx, sortOperators,
collations, nullsFirst);
}
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
#include "parser/parse_clause.h"
#include "rewrite/rewriteManip.h"
#include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
#include "utils/syscache.h"
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index de1257d9c2..496024cb16 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4650,13 +4650,13 @@ create_ordered_paths(PlannerInfo *root,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->sort_pathkeys,
- path->pathkeys);
- if (path == cheapest_input_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+ path->pathkeys);
+ if (path == cheapest_input_path || n_useful_pathkeys > 0)
{
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->sort_pathkeys))
{
/* An explicit sort here can take advantage of LIMIT */
path = (Path *) create_sort_path(root,
@@ -5786,8 +5786,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of seq scan + sort */
seqScanPath = create_seqscan_path(root, rel, NULL, 0);
- cost_sort(&seqScanAndSortPath, root, NIL,
- seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+ cost_sort(&seqScanAndSortPath, root, NIL, 0,
+ seqScanPath->startup_cost, seqScanPath->total_cost,
+ rel->tuples, rel->reltarget->width,
comparisonCost, maintenance_work_mem, -1.0);
/* Estimate the cost of index scan */
@@ -6023,14 +6024,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
foreach(lc, input_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
- bool is_sorted;
+ int n_useful_pathkeys;
- is_sorted = pathkeys_contained_in(root->group_pathkeys,
- path->pathkeys);
- if (path == cheapest_path || is_sorted)
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (path == cheapest_path || n_useful_pathkeys > 0)
{
/* Sort the cheapest-total path if it isn't already sorted */
- if (!is_sorted)
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
@@ -6092,21 +6093,24 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
foreach(lc, partially_grouped_rel->pathlist)
{
Path *path = (Path *) lfirst(lc);
+ int n_useful_pathkeys;
/*
* Insert a Sort node, if required. But there's no point in
- * sorting anything but the cheapest path.
+ * non-incremental sorting anything but the cheapest path.
*/
- if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
- {
- if (path != partially_grouped_rel->cheapest_total_path)
- continue;
+ n_useful_pathkeys = pathkeys_useful_for_ordering(
+ root->group_pathkeys, path->pathkeys);
+ if (n_useful_pathkeys == 0 &&
+ path != partially_grouped_rel->cheapest_total_path)
+ continue;
+
+ if (n_useful_pathkeys < list_length(root->group_pathkeys))
path = (Path *) create_sort_path(root,
grouped_rel,
path,
root->group_pathkeys,
-1.0);
- }
if (parse->hasAggs)
add_path(grouped_rel, (Path *)
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
case T_Hash:
case T_Material:
case T_Sort:
+ case T_IncrementalSort:
case T_Unique:
case T_SetOp:
case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b586f941a8..3bce376e38 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -987,7 +987,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
sorted_p.startup_cost = input_path->startup_cost;
sorted_p.total_cost = input_path->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
- cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+ cost_sort(&sorted_p, root, NIL, 0,
+ sorted_p.startup_cost, sorted_p.total_cost,
input_path->rows, input_path->pathtarget->width,
0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fe3b4582d4..aa154b8905 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
}
/*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
* Return -1, 0, or +1 according as path1 is cheaper, the same cost,
* or more expensive than path2 for fetching the specified fraction
* of the total tuples.
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int n_common_pathkeys;
pathnode->path.rows += subpath->rows;
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+ &n_common_pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->parent->tuples,
subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
/*
* Estimate cost for sort+unique implementation
*/
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ subpath->startup_cost,
subpath->total_cost,
rel->rows,
subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
GatherMergePath *pathnode = makeNode(GatherMergePath);
Cost input_startup_cost = 0;
Cost input_total_cost = 0;
+ int n_common_pathkeys;
Assert(subpath->parallel_safe);
Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
pathnode->path.pathtarget = target ? target : rel->reltarget;
pathnode->path.rows += subpath->rows;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
{
/* Subpath is adequately ordered, we won't need to sort it */
input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
cost_sort(&sort_path,
root,
pathkeys,
+ n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
List *pathkeys,
double limit_tuples)
{
- SortPath *pathnode = makeNode(SortPath);
+ SortPath *pathnode;
+ int n_common_pathkeys;
+
+ /*
+ * Use incremental sort when it's enabled and there are common pathkeys,
+ * use regular sort otherwise.
+ */
+ if (enable_incrementalsort)
+ n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ else
+ n_common_pathkeys = 0;
+
+ if (n_common_pathkeys == 0)
+ {
+ pathnode = makeNode(SortPath);
+ pathnode->path.pathtype = T_Sort;
+ }
+ else
+ {
+ IncrementalSortPath *incpathnode;
+
+ incpathnode = makeNode(IncrementalSortPath);
+ pathnode = &incpathnode->spath;
+ pathnode->path.pathtype = T_IncrementalSort;
+ incpathnode->presortedCols = n_common_pathkeys;
+ }
+
+ Assert(n_common_pathkeys < list_length(pathkeys));
- pathnode->path.pathtype = T_Sort;
pathnode->path.parent = rel;
/* Sort doesn't project, so use source path's pathtarget */
pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
pathnode->subpath = subpath;
- cost_sort(&pathnode->path, root, pathkeys,
+ cost_sort(&pathnode->path, root,
+ pathkeys, n_common_pathkeys,
+ subpath->startup_cost,
subpath->total_cost,
subpath->rows,
subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
else
{
/* Account for cost of sort, but don't charge input cost again */
- cost_sort(&sort_path, root, NIL,
+ cost_sort(&sort_path, root, NIL, 0,
+ 0.0,
0.0,
subpath->rows,
subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortNullsFirsts,
work_mem,
NULL,
- qstate->rescan_needed);
+ qstate->rescan_needed,
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf240aa9c5..b694a5828d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3716,6 +3716,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
return numdistinct;
}
+/*
+ * estimate_pathkeys_groups - Estimate number of groups which dataset is
+ * divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into. Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+ ListCell *l;
+ List *groupExprs = NIL;
+ double *result;
+ int i;
+
+ /*
+ * Get number of groups for each prefix of pathkeys.
+ */
+ i = 0;
+ result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ foreach(l, pathkeys)
+ {
+ PathKey *key = (PathKey *)lfirst(l);
+ EquivalenceMember *member = (EquivalenceMember *)
+ linitial(key->pk_eclass->ec_members);
+
+ groupExprs = lappend(groupExprs, member->em_expr);
+
+ result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ i++;
+ }
+
+ return result;
+}
+
/*
* Estimate hash bucket statistics when the specified expression is used
* as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..44a30c2430 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of incremental sort steps."),
+ NULL
+ },
+ &enable_incrementalsort,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..26263ab5e6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
#define PARALLEL_SORT(state) ((state)->shared == NULL ? 0 : \
(state)->worker >= 0 ? 1 : 2)
+#define INITAL_MEMTUPSIZE Max(1024, \
+ ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
/* GUC variables */
#ifdef TRACE_SORT
bool trace_sort = false;
@@ -243,6 +246,13 @@ struct Tuplesortstate
int64 allowedMem; /* total memory allowed, in bytes */
int maxTapes; /* number of tapes (Knuth's T) */
int tapeRange; /* maxTapes-1 (Knuth's P) */
+ int64 maxSpace; /* maximum amount of space occupied among sort
+ of groups, either in-memory or on-disk */
+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+ space, false when it's value for in-memory
+ space */
+ TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ MemoryContext maincontext;
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
@@ -647,6 +657,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
static void worker_nomergeruns(Tuplesortstate *state);
static void leader_takeover_tapes(Tuplesortstate *state);
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
/*
* Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
bool randomAccess)
{
Tuplesortstate *state;
+ MemoryContext maincontext;
MemoryContext sortcontext;
MemoryContext tuplecontext;
MemoryContext oldcontext;
@@ -691,13 +705,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
elog(ERROR, "random access disallowed under parallel sort");
/*
- * Create a working memory context for this sort operation. All data
- * needed by the sort will live inside this context.
+ * Memory context surviving tuplesort_reset. This memory context holds
+ * data which is useful to keep while sorting multiple similar batches.
*/
- sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+ maincontext = AllocSetContextCreate(CurrentMemoryContext,
"TupleSort main",
ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Create a working memory context for one sort operation. The content of
+ * this context is deleted by tuplesort_reset.
+ */
+ sortcontext = AllocSetContextCreate(maincontext,
+ "TupleSort sort",
+ ALLOCSET_DEFAULT_SIZES);
+
/*
* Caller tuple (e.g. IndexTuple) memory context.
*
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
* Make the Tuplesortstate within the per-sort context. This way, we
* don't need a separate pfree() operation for it at shutdown.
*/
- oldcontext = MemoryContextSwitchTo(sortcontext);
+ oldcontext = MemoryContextSwitchTo(maincontext);
state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->maincontext = maincontext;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
* Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
* see comments in grow_memtuples().
*/
- state->memtupsize = Max(1024,
- ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+ state->memtupsize = INITAL_MEMTUPSIZE;
state->growmemtuples = true;
state->slabAllocatorUsed = false;
state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, SortCoordinate coordinate, bool randomAccess)
+ int workMem, SortCoordinate coordinate,
+ bool randomAccess, bool skipAbbrev)
{
Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
randomAccess);
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
AssertArg(nkeys > 0);
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
sortKey->ssup_nulls_first = nullsFirstFlags[i];
sortKey->ssup_attno = attNums[i];
/* Convey if abbreviation optimization is applicable in principle */
- sortKey->abbreviate = (i == 0);
+ sortKey->abbreviate = (i == 0) && !skipAbbrev;
PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
MemoryContext oldcontext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
randomAccess);
MemoryContext oldcontext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ oldcontext = MemoryContextSwitchTo(state->maincontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
}
/*
- * tuplesort_end
+ * tuplesort_free
*
- * Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage. Be careful not to attempt to use or free such
- * pointers afterwards!
+ * Internal routine for freeing resources of tuplesort.
*/
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
{
/* context swap probably not needed, but let's be safe */
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,110 @@ tuplesort_end(Tuplesortstate *state)
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
- MemoryContextDelete(state->sortcontext);
+ if (delete)
+ {
+ MemoryContextDelete(state->maincontext);
+ }
+ else
+ {
+ MemoryContextResetOnly(state->sortcontext);
+ MemoryContextResetOnly(state->tuplecontext);
+ }
+}
+
+/*
+ * tuplesort_end
+ *
+ * Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage. Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+ tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ * Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+ int64 spaceUsed;
+ bool spaceUsedOnDisk;
+
+ /*
+ * Note: it might seem we should provide both memory and disk usage for a
+ * disk-based sort. However, the current code doesn't track memory space
+ * accurately once we have begun to return tuples to the caller (since we
+ * don't account for pfree's the caller is expected to do), so we cannot
+ * rely on availMem in a disk sort. This does not seem worth the overhead
+ * to fix. Is it worth creating an API for the memory context code to
+ * tell us how much is actually used in sortcontext?
+ */
+ if (state->tapeset)
+ {
+ spaceUsedOnDisk = true;
+ spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+ }
+ else
+ {
+ spaceUsedOnDisk = false;
+ spaceUsed = state->allowedMem - state->availMem;
+ }
+
+ /* XXX */
+ if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+ (spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+ {
+ state->maxSpace = spaceUsed;
+ state->maxSpaceOnDisk = spaceUsedOnDisk;
+ state->maxSpaceStatus = state->status;
+ }
+}
+
+/*
+ * tuplesort_reset
+ *
+ * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
+ * meta-information in. After tuplesort_reset, tuplesort is ready to start
+ * a new sort. It allows evade recreation of tuple sort (and save resources)
+ * when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ tuplesort_updatemax(state);
+ tuplesort_free(state, false);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+ state->availMem = state->allowedMem;
+ state->lastReturnedTuple = NULL;
+ state->slabAllocatorUsed = false;
+ state->slabMemoryBegin = NULL;
+ state->slabMemoryEnd = NULL;
+ state->slabFreeHead = NULL;
+ state->growmemtuples = true;
+
+ if (state->memtupsize < INITAL_MEMTUPSIZE)
+ {
+ if (state->memtuples)
+ pfree(state->memtuples);
+ state->memtuples = (SortTuple *) palloc(INITAL_MEMTUPSIZE * sizeof(SortTuple));
+ state->memtupsize = INITAL_MEMTUPSIZE;
+ }
+
+ USEMEM(state, GetMemoryChunkSpace(state->memtuples));
}
/*
@@ -2589,8 +2710,7 @@ mergeruns(Tuplesortstate *state)
* Reset tuple memory. We've freed all the tuples that we previously
* allocated. We will use the slab allocator from now on.
*/
- MemoryContextDelete(state->tuplecontext);
- state->tuplecontext = NULL;
+ MemoryContextResetOnly(state->tuplecontext);
/*
* We no longer need a large memtuples array. (We will allocate a smaller
@@ -2640,7 +2760,8 @@ mergeruns(Tuplesortstate *state)
* from each input tape.
*/
state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+ state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+ numInputTapes * sizeof(SortTuple));
USEMEM(state, GetMemoryChunkSpace(state->memtuples));
/*
@@ -3137,18 +3258,15 @@ tuplesort_get_stats(Tuplesortstate *state,
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
- if (state->tapeset)
- {
+ tuplesort_updatemax(state);
+
+ if (state->maxSpaceOnDisk)
stats->spaceType = SORT_SPACE_TYPE_DISK;
- stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- }
else
- {
stats->spaceType = SORT_SPACE_TYPE_MEMORY;
- stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
- }
+ stats->spaceUsed = (state->maxSpace + 1023) / 1024;
- switch (state->status)
+ switch (state->maxSpaceStatus)
{
case TSS_SORTEDINMEM:
if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820f43..fb1e336b9d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1764,6 +1764,20 @@ typedef struct MaterialState
Tuplestorestate *tuplestorestate;
} MaterialState;
+
+/* ----------------
+ * When performing sorting by multiple keys input dataset could be already
+ * presorted by some prefix of these keys. We call them "presorted keys".
+ * PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+ FmgrInfo flinfo; /* comparison function info */
+ FunctionCallInfoData fcinfo; /* comparison function call info */
+ OffsetNumber attno; /* attribute number in tuple */
+} PresortedKeyData;
+
/* ----------------
* Shared memory container for per-worker sort information
* ----------------
@@ -1792,6 +1806,45 @@ typedef struct SortState
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
+/* ----------------
+ * Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+ TuplesortInstrumentation sinstrument;
+ int64 groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+ int num_workers;
+ IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ * IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ bool bounded; /* is the result set bounded? */
+ int64 bound; /* if bounded, how many tuples are needed */
+ bool sort_Done; /* sort completed yet? */
+ bool finished; /* fetching tuples from outer node
+ is finished ? */
+ bool bounded_Done; /* value of bounded we did the sort with */
+ int64 bound_Done; /* value of bound we did the sort with */
+ void *tuplesortstate; /* private state of tuplesort.c */
+ PresortedKeyData *presortedKeys; /* keys, dataset is presorted by */
+ int64 groupsCount; /* number of groups with equal presorted keys */
+ /* slot for pivot tuple defining values of presorted keys within group */
+ TupleTableSlot *grpPivotSlot;
+ bool am_worker; /* are we a worker? */
+ SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
/* ---------------------
* GroupState information
* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
T_HashJoin,
T_Material,
T_Sort,
+ T_IncrementalSort,
T_Group,
T_Agg,
T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
T_HashJoinState,
T_MaterialState,
T_SortState,
+ T_IncrementalSortState,
T_GroupState,
T_AggState,
T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
T_ProjectionPath,
T_ProjectSetPath,
T_SortPath,
+ T_IncrementalSortPath,
T_GroupPath,
T_UpperUniquePath,
T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..13d9a75b50 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;
+
+/* ----------------
+ * incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+ Sort sort;
+ int presortedCols; /* number of presorted columns */
+} IncrementalSort;
+
/* ---------------
* group node -
* Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..5b0c63add9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
Path *subpath; /* path representing input source */
} SortPath;
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+ SortPath spath;
+ int presortedCols; /* number of presorted columns */
+} IncrementalSortPath;
+
+
/*
* GroupPath represents grouping (of presorted input)
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
extern PGDLLIMPORT bool enable_hashagg;
extern PGDLLIMPORT bool enable_nestloop;
extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
RelOptInfo *baserel, ParamPathInfo *param_info);
extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
- List *pathkeys, Cost input_cost, double tuples, int width,
- Cost comparison_cost, int sort_mem,
+ List *pathkeys, int presorted_keys,
+ Cost input_startup_cost, Cost input_total_cost,
+ double tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_append(AppendPath *path);
extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..597c5052a9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
Relids required_outer,
CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
List *mergeclauses,
List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
extern List *truncate_useless_pathkeys(PlannerInfo *root,
RelOptInfo *rel,
List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows, List **pgset);
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ double tuples);
+
extern void estimate_hash_bucket_stats(PlannerInfo *root,
Node *hashkey, double nbuckets,
Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
int workMem, SortCoordinate coordinate,
- bool randomAccess);
+ bool randomAccess, bool skipAbbrev);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel, int workMem,
SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
extern void tuplesort_end(Tuplesortstate *state);
+extern void tuplesort_reset(Tuplesortstate *state);
+
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
QUERY PLAN
-Sort
+Incremental Sort
Sort Key: id, data
- -> Seq Scan on test_dc
+ Presorted Key: id
+ -> Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)
step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
id data
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE: drop cascades to table matest1
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
{3,7,8,10,13,13,16,18,19,22}
(3 rows)
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Merge Append
+ Sort Key: tenk1.thousand, tenk1.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ -> Incremental Sort
+ Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+ QUERY PLAN
+-------------------------------------------------------------
+ Merge Append
+ Sort Key: a.thousand, a.tenthous
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 a
+ -> Incremental Sort
+ Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
+ -> Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4d5931d67e..cec3b22fb5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
left join
(select * from tenk1 y order by y.unique2) y
on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
- QUERY PLAN
-----------------------------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------------------------------------
Aggregate
-> Merge Left Join
- Merge Cond: (x.thousand = y.unique2)
- Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+ Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
-> Sort
Sort Key: x.thousand, x.twothousand, x.fivethous
-> Seq Scan on tenk1 x
-> Materialize
- -> Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+ -> Incremental Sort
+ Sort Key: y.unique2, y.hundred
+ Presorted Key: y.unique2
+ -> Subquery Scan on y
+ -> Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
select count(*) from
(select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 4fccd9ae54..e0290977f1 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -935,10 +935,12 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
QUERY PLAN
----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
Sort Key: t1.a, t2.b, ((t3.a + t3.b))
+ Presorted Key: t1.a
-> Result
- -> Append
+ -> Merge Append
+ Sort Key: t1.a
-> Merge Left Join
Merge Cond: (t1.a = t2.b)
-> Sort
@@ -987,7 +989,7 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
-> Sort
Sort Key: t2_2.b
-> Seq Scan on prt2_p3 t2_2
-(52 rows)
+(54 rows)
SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
a | c | b | c | ?column? | c
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
+ enable_incrementalsort | on
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(15 rows)
+(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
set enable_seqscan = off;
set enable_indexscan = on;
set enable_bitmapscan = off;
+set enable_incrementalsort = off;
-- Check handling of duplicated, constant, or volatile targetlist items
explain (costs off)
@@ -607,9 +608,26 @@ SELECT
ORDER BY f.i LIMIT 10)
FROM generate_series(1, 3) g(i);
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+ (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+ UNION ALL
+ SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+reset enable_incrementalsort;
--
-- Check that constraint exclusion works correctly with partitions using
On 03/05/2018 11:07 PM, Alexander Korotkov wrote:
Hi!
Thank you for reviewing this patch!
Revised version is attached.
OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.
Regarding the benchmarks, I generally used queries of the form
SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b
with the first sort done in various ways:
* regular Sort node
* indexes with Index Scan
* indexes with Index Only Scan
and all these three options with and without LIMIT (the limit was set to
1% of the source table).
I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).
All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.
The full scripts and results are available here:
https://github.com/tvondra/incremental-sort-tests
The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.
I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.
Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.
In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).
There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.
Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be
On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).
In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.
Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.
I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
sort-10000.tgzapplication/x-compressed-tar; name=sort-10000.tgzDownload
� �u�Z ��uXVm�7#�JIJH#���t��t�tJ7HJKw� %�RR� �����@�|����w��>�� k��3�������Y�qY�YXQ�~�����20��2601��6�` ���?Z�?&&��kfF���A1���3B���cq1�~C��f`b`� �������?kK+MBB�o�����������=�B������)��2f��Q� ���D���\G����@[�������������%�����3mkS+jKs���::V�e:��5�A�� ~��d���6�2�40��������T�����#�G Da�w���@?WpA��Cmo��~+�^ ���������������OLL�������ZZZ������������������iii���������MMM�����Y[[������4^]!A��C@��������P5���*�%���3qtr����S����[��o�+�zH��������;P>��8?��L�9�m%q�G#�����5�� ����p������D�z�0~�|�%���D�����s;t���� ��������?Z��}��.VRH,Fa|����`������
z��@>�o���Y"Jr{C%J*������m�Yh)�������k�����j�����.����tcj&sL��Z�x�L����\:k�r���z�*����9�o8*RJ4������k�Q�\M�:�l��T�x�(��cn�
�P�v<�����+�VvW?F��.?���,{.�d1�o�33�V�'�N
?��tW'z����cAUM��Wa.x����d�FC�YS��x����V�TI�*JNtzGe��������W-+?�)�t�]x��U@E2��j{���7�����Z�s$�!��f�|�E�z�s���,/4
�Q&�p��nc
�E��_��L��4�R~N.�)��hC���BJ�<����Wu�=�6��l.���O���)�������H�5������<���~�9�Z�d��J�X!h@d�z�8+k�aK��:b��
V����w~����YQ2�P�
[���)�������n�%�������������_�O���>ui�>/���^������*h*�.��S������%���; M��E���\��t� 5�L��
��n6�t�qb�O��{�P?t���HU0�#VnU{ hM�M������JUX�Gcs ���?%!�ymV�KI�1a�����g����z��5�ka�c�?�(�M7�N���vvF �#���#�x1� ���,���n
5]\�IN��<eo��fss�����E�OO�&�'d�����{����7�,��h�1�7���0���Ey@�9���������:A���� �X�h�>��aJ�!�]��|�C��*(ZN>�E�.|���7�T��o��NaJ�<4(g�%`�t=4�a&�C�'�Wc3��=�(���:�z�� �`�����'������!-�O��l*�0X������LAK�Qh
>�_���U�5���I�&(S�U��)�&s��"��so��Pf$ �M �~�Ar����R�@��Q�"������?��H>]��{�#��Uq|���1��,��<q�.p����1��� �D�� �g��Q���b���
���:i�\��b�q����e���DVt=s4���&$��'���G��ZM�up����_u��L���K��E�@��&�����y��x�X��TO�d�<�e�Q�f��z�s���JS��u�B���l����V���BTr4�����|���]L���7�����_2!�U�E���}�`���=��M� �^9l���|��(���&�����V���n�5Lyj�1����!�Hu���h������cW�7`��q-a�������%R���N���y�������]��*�%���F��}M*{�3��i�SL�'�UU�����_p&��Zv5�z��z��b��L����~����e�I�$��'T+J��&�1,��[��j�b�@a&�:�k����� ��������c���R�3�O�Mu[H��K��>���b������DN�oL�y�����Z��t��(q����V-�����D�
�����3�Y�^���Z���Q�7#��U'�O!w�
2���t��}��%$m��n���hQ� -Q�OG���i (�Zv��m*����t�')��R9i�"�^j�EN���E�hWD����h�~��4V���A�K�
�'b�{:��G{���!<E��/��]{���,���Y� �7?r�/#� �<S$v��.F�ns�0�_LCXq���"t����6|hB��^p�Vr�7]�O1h����,����i�Lj<�<+y*����p0A�sA�<�f�����3����vT���"iP�!I�Y+�+����k�i�v�f1���n��o��))�����DuG����b�l��%�=
�m5��JI��R[� >vm)1��������/T�1���Bt|�%���F�hE^��>�x!-]�tR��^���$�@����y����z����1�{�Y������Ic��af�d��.a��Rw��(?������s(����K���.]��x)������LC1��MT^Q��bKIoZ1�T:}�0�y�K��q*���������j�����ex�W���U�������M+���M��m#��������<�;��A/��r)�G���=_c�������!�o��
�P��J��aQJ�1��������]h�#q@��-ZvY-�~�C{N�A��`������ZW�`E���6!� ����v�,sG�u
:?��s�"I`�A�����k������/��8ej^ )�R����2��&���0_�M�k2�����~�e���KH+���]�eFiB�S1���&C�C��O���b4|d<b}�
��T��g^���7������I12��������Z3��j���.B����5-V���Be)��E�g��O������<��>���,>�
��X�����wK��:�G��jn`&#�p�C�e�b�C���N�i�f�i�|f(A���u#!a��X�+�o����{�
�c����r��t0�t�NW3�����z4\�_���12�k\.j��z>�IdGTU�� ��p���{:�y*���Qkd��P�nLX%�E:n�n����?h��b�*�����b��x�\��E��D���Yv��X>�����/�(���i�j�\�'�g]�z��B/Z���������^�+;�?������0���9��������k��I��tT�-~��m: G�n���A4���M�"N�Ma��P1U�Kw��W�
��'������A�&�F�����'��E�����L9k�q8bSs���:%A�����Z�jt�p����`�b"
t����'�a���\�����r:t�K�0JtQ�,����������xV�sc���c�H��z�2�.����;l��� }F��A^S�_����*}��=��7����"��xI&U��Y�
0O����<r�@>6|D!��h>�������V��lN��_���j�RTZU�MVhW���I�������~���0~� s{T"hn����i���,��g����n����
�\CFjR�����}����f]�b��������D�9/v(*\��]�8��~�k�q4yc���uX�`�b�9A�Wo�({k������M�I�5�?�����=SS��k���"�o!��$��b��!l��%`,�0���4��r��(m����}#��0��9���[�
����{d��\�������&��:g����?�Gd����������=FW�J����>J�1K9
��]�h.X��}������h��q������#�O��aBb"���v&?Z�;U�2��k�f �������lR)�1����+tKF���X!�BUn@~#� ���ZB������ok!�p4@X*qb�����K&P����],��������XX��@��
��"=�7z����O��$B���������
��0x�tE��_P�{ +�.Wt�����T�b�E�t��b�a#�{Kh$���z����eU�4_�����U/�j��W��T���Z2g�p��������7��������Fj6���QF�J�ZS�j�c��Q�����h��kq���K>],X�X��������<x�����z�����;x���p��=��r���T�f�b���ZBu�8!q��K�=R���w���� �=����,��^�=�����Y�����f�v����d�^�������{�C5�S��:$��7�N����M��H~���Z��$p�q�}�����}B�K{�������%#������Be��R�W���z�d�?1��w��@$����t%��G���}���E&�j�.iI +�����p�
L�?Cm�xYu��&���Q ��N?��`��������k}f$�R����g���JW�6q����x��R��3��� �����(�{��g7Ssdq��E`K������ V�@����R��C�L���e���h���8�����������m{_��^���bTLT���Rng;�f��f���zC.S�������x����F��b8�R�-�~-^���2z�J:�^h��L;�]
�����������5� �a�O'B����2�m~\��7v��N��I���]�@�R���L��3i��������� O���m���JL��W��V���>9u��0���/�/-1�v���=I����������m��u� ��|�-��=$���P6��T�����UtR��Z�
��A.���U��a\��RT�����������F���!��Fu�/rtbp�l��V�0�c�������<�*u*U��4z���m�V��*�%+���s�2�d���\��J �-�p�sW�[�+�To���)#��Y#����HU���A�����FHC�Y�7{���u)7��j��=3<o�&q�HX1r3�4��SykS�����,�]�g�G�v�N� {�lJ/� �v_p�v�F�;N�X�Q��+�|��?,h�dQ(�����h����fO@��L��?. �&�~�m�0��=��Q�>p�J���g�Y��{���<����N���v�M�I
��Zw��y�����v���F�1C����(�k�c;6yF�%-���QP��t��^";Q(VS'v36
s
<�8������e�;~.�k��8S�KrN$���SrF����JI}����zJ�����q�T�k,u\~�|�VV��1,X���}q��vHf.]�(2�H���j�_�������eH���������%���������K�z�q�a^�d>�w��+���H|�
%��/#�baz�s�D&�������������\��t�V6}%�j�p)����9���kx�h��)��?dU���b(T���a<2% y?Lh�0�B]�8����1�<8`�'�Q��7�E�c!���&x� �s}���� �5*��\�~�I;���?�����r�Kt��o^�o�������811�m
1|&�_���)%W�}�6�R���2<! ���d%�/:�qK��������\�8FB�
�-[�?�]y%s0V�{V��V%������� ��������IL�47&�B���aM�2!: ����Y?�&c���ABZ�W}���>��"H���%�����4hU���<����C^6��I`s
4"� �#�J�;���z� -�������&$@��z-��w�4��b����O����;����^.^?�Q.B����:662��w}g"������������E�g`�S����!YbO�� ��0A|#�����k�: �I9�������m����9u�(�F�qu|�����Q+�o�:��9MY�'���7}H��o��w���]��'��&�u��%)Z�v(yX&