Index Skip Scan
Hi all,
I would like to start a discussion on Index Skip Scan referred to as
Loose Index Scan in the wiki [1]https://wiki.postgresql.org/wiki/Loose_indexscan.
My use-case is the simplest form of Index Skip Scan (B-Tree only),
namely going from
CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
HashAggregate (cost=169247.71..169247.74 rows=3 width=4) (actual
time=4104.099..4104.099 rows=3 loops=1)
Output: b
Group Key: t1.b
Buffers: shared hit=44248
-> Seq Scan on public.t1 (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.059..1050.376 rows=10000000 loops=1)
Output: a, b
Buffers: shared hit=44248
Planning Time: 0.157 ms
Execution Time: 4104.155 ms
(9 rows)
to
CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3
width=4) (actual time=0.061..0.137 rows=3 loops=1)
Output: b
Heap Fetches: 3
Buffers: shared hit=13
Planning Time: 0.155 ms
Execution Time: 0.170 ms
(6 rows)
I took Thomas Munro's previous patch [2]/messages/by-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw@mail.gmail.com on the subject, added a GUC, a
test case, documentation hooks, minor code cleanups, and made the patch
pass an --enable-cassert make check-world run. So, the overall design is
the same.
However, as Robert Haas noted in the thread there are issues with the
patch as is, especially in relationship to the amcanbackward functionality.
A couple of questions to begin with.
Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.
Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.
I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.
Any feedback, suggestions, design ideas and help with the patch in
general is greatly appreciated.
Thanks in advance !
[1]: https://wiki.postgresql.org/wiki/Loose_indexscan
[2]: /messages/by-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw@mail.gmail.com
/messages/by-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw@mail.gmail.com
Best regards,
Jesper
Attachments:
wip_indexskipscan.patchtext/x-patch; name=wip_indexskipscan.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b60240ecfe..d03d64a4bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3769,6 +3769,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 24c3405f91..bd6a2a7b93 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -664,6 +665,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 14a1aa56cb..5fd9f81f23 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1312,6 +1312,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc921f..f9451768c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -791,6 +792,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 27a3032e42..77d9036d1c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0bcfa10b86..15fabb7c8b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1163,6 +1163,121 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /* Release the current associated buffer */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(scan->indexRelation, scan->xs_itup);
+ }
+ else
+ {
+ Relation rel;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ rel = scan->indexRelation;
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) | (rel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Use _bt_search and _bt_binsrch to get the buffer and offset number */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4a9b5da268..f30e519bb5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -59,6 +59,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73d94b7235..2a5cbbf5c2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -993,7 +993,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
pname = sname = "Index Scan";
break;
case T_IndexOnlyScan:
- pname = sname = "Index Only Scan";
+ {
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ pname = sname = "Index Skip Scan";
+ else
+ pname = sname = "Index Only Scan";
+ }
break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
@@ -1222,6 +1228,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 3a02a99621..a6a4e05ec8 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +260,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +524,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_HeapFetches = 0;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1c12075b01..cb7d85da5f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -516,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..76f4dac0b5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cf82b7052d..e95676c7f2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -183,7 +183,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2775,7 +2776,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5163,7 +5165,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5178,6 +5181,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a581..20484ec3c5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4711,6 +4711,17 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip)
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root, distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows));
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e190ad49d1..77716f661a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7905..8905fe44dd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -844,6 +844,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f43086f6d0..b5f397e12f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -298,6 +298,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 24c720bf42..4c260115b4 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7f52cab0..eeff29937b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1313,6 +1313,9 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * HeapFetches number of tuples we were forced to fetch from heap
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1331,6 +1334,9 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ long ioss_HeapFetches;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 5201c6d4bc..773c238ada 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -434,6 +434,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 5af484024a..01df4e0486 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -797,6 +797,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1147,6 +1148,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1161,6 +1165,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e99ae36bef..3cd1b8b7ac 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index fc81088d4b..b66481fb24 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f443ef4623 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,21 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Skip Scan using tenk1_four on public.tenk1
+ Output: four
+(2 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f9e7118f0d..3a7d59243c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..db222fa510 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,9 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.17.1
Hi!
On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
I would like to start a discussion on Index Skip Scan referred to as
Loose Index Scan in the wiki [1].
Great, I glad to see you working in this!
However, as Robert Haas noted in the thread there are issues with the
patch as is, especially in relationship to the amcanbackward functionality.A couple of questions to begin with.
Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.
Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.
Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.
Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient. So, it seems to not worth
implementing, at least in the initial version. And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together? What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.
I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.
Please, register it on commitfest. If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
Hi all,
I would like to start a discussion on Index Skip Scan referred to as
Loose Index Scan in the wiki [1].
awesome
I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. T
New large features are not appropriate for the July CF. September should
be your goal.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. TNew large features are not appropriate for the July CF. September should
be your goal.
Assuming this, should we have possibility to register patch to
September CF from now?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:
Assuming this, should we have possibility to register patch to
September CF from now?
There cannot be two commit fests marked as open at the same time as
Magnus mentions here:
/messages/by-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com
There is no issue in registering new patches in future ones for admins,
but normal users cannot, right? In this case, could you wait that the
next CF is marked as in progress and that the one of September is
opened? You could also add it to the July one if you are not willing to
wait, and that will get bumped by one of the CFMs, but this makes the
whole process unnecessary noisy.
--
Michael
On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
A couple of questions to begin with.
Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.
In one of patches I'm working on I had similar situation, when I wanted to
split one node into two similar nodes (before I just extended it) and logically
it made perfect sense. But it turned out to be quite useless and the advantage
I've got wasn't worth it - and just to mention, those nodes had more differences
than in this patch. So I agree that probably it would be better to keep using
IndexOnlyScan.
On 19 June 2018 at 03:40, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:Assuming this, should we have possibility to register patch to
September CF from now?There cannot be two commit fests marked as open at the same time as
Magnus mentions here:
/messages/by-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.comIn this case, could you wait that the next CF is marked as in progress and
that the one of September is opened?
Yep, since the next CF will start shortly that's the easiest thing to do.
Hi Alexander,
On 06/18/2018 01:31 PM, Alexander Korotkov wrote:
<jesper.pedersen@redhat.com> wrote:
Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.
Yes, there are likely other use-cases for Index Skip Scan apart from the
simplest form. Which sort of suggests that having dedicated nodes would
be better in the long run.
My goal is to cover the simplest form, which can be handled by extending
the T_IndexOnlyScan node, or by having common functions that both use.
We can always improve the functionality with future patches.
Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient. So, it seems to not worth
implementing, at least in the initial version. > And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together? What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.
The feedback from Robert Haas seems to suggest that it was a requirement
for the patch to be considered.
I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.Please, register it on commitfest. If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.
Based on the feedback from Andrew and Michael I won't register this
thread until the September CF.
Thanks for your feedback !
Best regards,
Jesper
Hi Dmitry,
On 06/19/2018 06:01 AM, Dmitry Dolgov wrote:
On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.In one of patches I'm working on I had similar situation, when I wanted to
split one node into two similar nodes (before I just extended it) and logically
it made perfect sense. But it turned out to be quite useless and the advantage
I've got wasn't worth it - and just to mention, those nodes had more differences
than in this patch. So I agree that probably it would be better to keep using
IndexOnlyScan.
I looked at this today, and creating a new node (T_IndexOnlySkipScan)
would make the patch more complex.
The question is if the patch should create such a node such that future
patches didn't have to deal with refactoring to a new node to cover
additional functionality.
Thanks for your feedback !
Best regards,
Jesper
Hello Jesper,
I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possible that index skip scan can address it as well?
postgres=# explain verbose select b from t1 group by b;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Group (cost=97331.29..97332.01 rows=3 width=4)
Output: b
Group Key: t1.b
-> Gather Merge (cost=97331.29..97331.99 rows=6 width=4)
Output: b
Workers Planned: 2
-> Sort (cost=96331.27..96331.27 rows=3 width=4)
Output: b
Sort Key: t1.b
-> Partial HashAggregate (cost=96331.21..96331.24 rows=3 width=4)
Output: b
Group Key: t1.b
-> Parallel Seq Scan on public.t1 (cost=0.00..85914.57 rows=4166657 width=4)
Output: a, b
(14 rows)
Time: 1.167 ms
— And here is the original example
postgres=# explain verbose SELECT DISTINCT b FROM t1;
QUERY PLAN
-------------------------------------------------------------------------------
Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4)
Output: b
(2 rows)
Time: 0.987 ms
Show quoted text
On Jun 18, 2018, at 10:31 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!
On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:I would like to start a discussion on Index Skip Scan referred to as
Loose Index Scan in the wiki [1].Great, I glad to see you working in this!
However, as Robert Haas noted in the thread there are issues with the
patch as is, especially in relationship to the amcanbackward functionality.A couple of questions to begin with.
Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient. So, it seems to not worth
implementing, at least in the initial version. And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together? What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.Please, register it on commitfest. If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Aug 16, 2018 at 5:44 PM, Bhushan Uparkar
<bhushan.uparkar@gmail.com> wrote:
I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possible that index skip scan can address it as well?
Yeah, there are a few tricks you can do with "index skip scans"
(Oracle name, or as IBM calls them, "index jump scans"... I was
slightly tempted to suggest we call ours "index hop scans"...). For
example:
* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)
You're right that a very simple GROUP BY can be equivalent to a
DISTINCT query, but I'm not sure if it's worth recognising that
directly or trying to implement the more general grouping trick that
can handle MIN/MAX, and whether that should be the same executor
node... The idea of starting with DISTINCT was just that it's
comparatively easy. We should certainly try to look ahead and bear
those features in mind when figuring out the interfaces though. Would
the indexam skip(scan, direction, prefix_size) operation I proposed be
sufficient? Is there a better way?
I'm glad to see this topic come back!
--
Thomas Munro
http://www.enterprisedb.com
Hi Bhushan,
On 08/16/2018 01:44 AM, Bhushan Uparkar wrote:
I was reviewing index-skip patch example and have a comment on it.
Thanks for your interest, and feedback on this patch !
Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form > of query it came up with different plan, is it possible that index skip scan can address it as well?
Like Thomas commented down-thread my goal is to keep this contribution
as simple as possible in order to get to something that can be
committed. Improvements can follow in future CommitFests, which may end
up in the same release.
However, as stated in my original mail my goal is the simplest form of
Index Skip Scan (or whatever we call it). I welcome any help with the patch.
Best regards,
Jesper
Hi Thomas,
On 08/16/2018 02:22 AM, Thomas Munro wrote:
The idea of starting with DISTINCT was just that it's
comparatively easy. We should certainly try to look ahead and bear
those features in mind when figuring out the interfaces though. Would
the indexam skip(scan, direction, prefix_size) operation I proposed be
sufficient? Is there a better way?
Yeah, I'm hoping that a Committer can provide some feedback on the
direction that this patch needs to take.
One thing to consider is the pluggable storage patch, which is a lot
more important than this patch. I don't want this patch to get in the
way of that work, so it may have to wait a bit in order to see any new
potential requirements.
I'm glad to see this topic come back!
You did the work, and yes hopefully we can get closer to this subject in
12 :)
Best regards,
Jesper
Greetings,
* Jesper Pedersen (jesper.pedersen@redhat.com) wrote:
On 08/16/2018 02:22 AM, Thomas Munro wrote:
The idea of starting with DISTINCT was just that it's
comparatively easy. We should certainly try to look ahead and bear
those features in mind when figuring out the interfaces though. Would
the indexam skip(scan, direction, prefix_size) operation I proposed be
sufficient? Is there a better way?Yeah, I'm hoping that a Committer can provide some feedback on the direction
that this patch needs to take.
Thomas is one these days. :)
At least on first glance, that indexam seems to make sense to me, but
I've not spent a lot of time thinking about it. Might be interesting to
ask Peter G about it though.
One thing to consider is the pluggable storage patch, which is a lot more
important than this patch. I don't want this patch to get in the way of that
work, so it may have to wait a bit in order to see any new potential
requirements.
Not sure where this came from, but I don't think it's particularly good
to be suggesting that one feature is more important than another or that
we need to have one wait for another as this seems to imply. I'd
certainly really like to see PG finally have skipping scans, for one
thing, and it seems like with some effort that might be able to happen
for v12. I'm not convinced that we're going to get pluggable storage to
happen in v12 and I don't really agree with recommending that people
hold off on making improvements to things because it's coming.
Thanks!
Stephen
Hi Stephen,
On 08/16/2018 02:36 PM, Stephen Frost wrote:
Yeah, I'm hoping that a Committer can provide some feedback on the direction
that this patch needs to take.Thomas is one these days. :)
I know :) However, there are some open questions from Thomas' original
submission that still needs to be ironed out.
At least on first glance, that indexam seems to make sense to me, but
I've not spent a lot of time thinking about it. Might be interesting to
ask Peter G about it though.
Yes, or Anastasia who also have done a lot of work on nbtree/.
One thing to consider is the pluggable storage patch, which is a lot more
important than this patch. I don't want this patch to get in the way of that
work, so it may have to wait a bit in order to see any new potential
requirements.Not sure where this came from, but I don't think it's particularly good
to be suggesting that one feature is more important than another or that
we need to have one wait for another as this seems to imply. I'd
certainly really like to see PG finally have skipping scans, for one
thing, and it seems like with some effort that might be able to happen
for v12. I'm not convinced that we're going to get pluggable storage to
happen in v12 and I don't really agree with recommending that people
hold off on making improvements to things because it's coming.
My point was that I know this patch needs work, so any feedback that get
it closer to a solution will help.
Pluggable storage may / may not add new requirements, but it is up to
the people working on that, some of which are Committers, to take time
"off" to provide feedback for this patch in order steer me in right
direction.
Work can happen in parallel, and I'm definitely not recommending that
people hold off on any patches that they want to provide feedback for,
or submit for a CommitFest.
Best regards,
Jesper
Greetings,
* Jesper Pedersen (jesper.pedersen@redhat.com) wrote:
On 08/16/2018 02:36 PM, Stephen Frost wrote:
Not sure where this came from, but I don't think it's particularly good
to be suggesting that one feature is more important than another or that
we need to have one wait for another as this seems to imply. I'd
certainly really like to see PG finally have skipping scans, for one
thing, and it seems like with some effort that might be able to happen
for v12. I'm not convinced that we're going to get pluggable storage to
happen in v12 and I don't really agree with recommending that people
hold off on making improvements to things because it's coming.My point was that I know this patch needs work, so any feedback that get it
closer to a solution will help.Pluggable storage may / may not add new requirements, but it is up to the
people working on that, some of which are Committers, to take time "off" to
provide feedback for this patch in order steer me in right direction.
I don't think it's really necessary for this work to be suffering under
some concern that pluggable storage will make it have to change. Sure,
it might, but it also very well might not. For my 2c, anyway, this
seems likely to get into the tree before pluggable storage does and it's
pretty unlikely to be the only thing that that work will need to be
prepared to address when it happens.
Work can happen in parallel, and I'm definitely not recommending that people
hold off on any patches that they want to provide feedback for, or submit
for a CommitFest.
Yes, work can happen in parallel, and I don't really think there needs
to be a concern about some other patch set when it comes to getting this
patch committed.
Thanks!
Stephen
On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Yeah, there are a few tricks you can do with "index skip scans"
(Oracle name, or as IBM calls them, "index jump scans"... I was
slightly tempted to suggest we call ours "index hop scans"...).
Hopscotch scans?
* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)
FWIW, I suspect that we're going to have the biggest problems in the
optimizer. It's not as if ndistinct is in any way reliable. That may
matter more on average than it has with other path types.
--
Peter Geoghegan
On August 16, 2018 8:28:45 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
One thing to consider is the pluggable storage patch, which is a lot
more important than this patch. I don't want this patch to get in the
way of that work, so it may have to wait a bit in order to see any new
potential requirements.
I don't think this would be a meaningful, relative to the size of the patch sets, amount of conflict between the two. So I don't think we have to consider relative importance (which I don't think is that easy to assess in this case).
Fwiw, I've a significantly further revised version of the tableam patch that I plan to send in a few days. Ported the current zheap patch as a separate AM which helped weed out a lot of issues.
Andres
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Aug 17, 2018 at 7:48 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)FWIW, I suspect that we're going to have the biggest problems in the
optimizer. It's not as if ndistinct is in any way reliable. That may
matter more on average than it has with other path types.
Can you give an example of problematic ndistinct underestimation?
I suppose you might be able to defend against that in the executor: if
you find that you've done an unexpectedly high number of skips, you
could fall back to regular next-tuple mode. Unfortunately that's
require the parent plan node to tolerate non-unique results.
I noticed that the current patch doesn't care about restrictions on
the range (SELECT DISTINCT a FROM t WHERE a BETWEEN 500 and 600), but
that causes it to overestimate the number of btree searches, which is
a less serious problem (it might not chose a skip scan when it would
have been better).
--
Thomas Munro
http://www.enterprisedb.com
Hi Peter,
On 08/16/2018 03:48 PM, Peter Geoghegan wrote:
On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)FWIW, I suspect that we're going to have the biggest problems in the
optimizer. It's not as if ndistinct is in any way reliable. That may
matter more on average than it has with other path types.
Thanks for sharing this; it is very useful to know.
Best regards,
Jesper
On Thu, Aug 16, 2018 at 4:10 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Can you give an example of problematic ndistinct underestimation?
Yes. See /messages/by-id/CAKuK5J12QokFh88tQz-oJMSiBg2QyjM7K7HLnbYi3Ze+Y5BtWQ@mail.gmail.com,
for example. That's a complaint about an underestimation specifically.
This seems to come up about once every 3 years, at least from my
perspective. I'm always surprised that ndistinct doesn't get
implicated in bad query plans more frequently.
I suppose you might be able to defend against that in the executor: if
you find that you've done an unexpectedly high number of skips, you
could fall back to regular next-tuple mode. Unfortunately that's
require the parent plan node to tolerate non-unique results.
I like the idea of dynamic fallback in certain situations, but the
details always seem complicated.
--
Peter Geoghegan
On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
I took Thomas Munro's previous patch [2] on the subject, added a GUC, a
test case, documentation hooks, minor code cleanups, and made the patch
pass an --enable-cassert make check-world run. So, the overall design is
the same.
I've looked through the patch more closely, and have a few questions:
* Is there any reason why only copy function for the IndexOnlyScan node
includes an implementation for distinctPrefix? Without read/out functionality
skip doesn't work for parallel scans, so it becomes like that:
=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
QUERY PLAN
-------------------------------------------------------------------------------
Gather (cost=1000.43..1001.60 rows=3 width=4)
(actual time=11.054..17672.010 rows=10000000 loops=1)
Output: b
Workers Planned: 1
Workers Launched: 1
Single Copy: true
Buffers: shared hit=91035 read=167369
-> Index Skip Scan using idx_t1_b on public.t1
(cost=0.43..1.30 rows=3 width=4)
(actual time=1.350..16065.165 rows=10000000 loops=1)
Output: b
Heap Fetches: 10000000
Buffers: shared hit=91035 read=167369
Worker 0: actual time=1.350..16065.165 rows=10000000 loops=1
Buffers: shared hit=91035 read=167369
Planning Time: 0.394 ms
Execution Time: 6037.800 ms
and with this functionality it gets better:
=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
QUERY PLAN
-------------------------------------------------------------------------------
Gather (cost=1000.43..1001.60 rows=3 width=4)
(actual time=3.564..4.427 rows=3 loops=1)
Output: b
Workers Planned: 1
Workers Launched: 1
Single Copy: true
Buffers: shared hit=4 read=10
-> Index Skip Scan using idx_t1_b on public.t1
(cost=0.43..1.30 rows=3 width=4)
(actual time=0.065..0.133 rows=3 loops=1)
Output: b
Heap Fetches: 3
Buffers: shared hit=4 read=10
Worker 0: actual time=0.065..0.133 rows=3 loops=1
Buffers: shared hit=4 read=10
Planning Time: 1.724 ms
Execution Time: 4.522 ms
* What is the purpose of HeapFetches? I don't see any usage of this variable
except assigning 0 to it once.
Hi Dmitry,
On 9/10/18 5:47 PM, Dmitry Dolgov wrote:
On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
I've looked through the patch more closely, and have a few questions:
Thanks for your review !
* Is there any reason why only copy function for the IndexOnlyScan node
includes an implementation for distinctPrefix?
Oversight -- thanks for catching that.
* What is the purpose of HeapFetches? I don't see any usage of this variable
except assigning 0 to it once.
That can be removed.
New version WIP v2 attached.
Best regards,
Jesper
Attachments:
wip_indexskipscan_v2.patchtext/x-patch; name=wip_indexskipscan_v2.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..fd06549491 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index a57c5e2e1f..842db029fa 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1312,6 +1312,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc921f..f9451768c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -791,6 +792,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..52c24c7541 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,121 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /* Release the current associated buffer */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(scan->indexRelation, scan->xs_itup);
+ }
+ else
+ {
+ Relation rel;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ rel = scan->indexRelation;
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) | (rel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Use _bt_search and _bt_binsrch to get the buffer and offset number */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 6d59b316ae..157b008284 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -59,6 +59,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..7e038cd9a3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -993,7 +993,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
pname = sname = "Index Scan";
break;
case T_IndexOnlyScan:
- pname = sname = "Index Only Scan";
+ {
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ pname = sname = "Index Skip Scan";
+ else
+ pname = sname = "Index Only Scan";
+ }
break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
@@ -1222,6 +1228,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..e69157741f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +260,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +524,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b5af904c18..6892d94dcf 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -582,6 +582,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3254524223..ab118c4b57 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1780,6 +1780,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 96bf0601a8..0c3ace4902 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4720,6 +4720,17 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip)
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root, distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows));
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0625eff219..04eeb984cd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7486d20a34..3637133d18 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 24c720bf42..4c260115b4 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c830f141b1..9e9bee0beb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,6 +1343,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index be25101db2..01bae00fdb 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f443ef4623 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,21 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Skip Scan using tenk1_four on public.tenk1
+ Output: four
+(2 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f9e7118f0d..3a7d59243c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..db222fa510 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,9 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.17.1
Hi Jesper,
While testing this patch I noticed that current implementation doesn't
perform well when we have lots of small groups of equal values. Here is
the execution time of index skip scan vs unique over index scan, in ms,
depending on the size of group. The benchmark script is attached.
group size skip unique
1 2,293.85 132.55
5 464.40 106.59
10 239.61 102.02
50 56.59 98.74
100 32.56 103.04
500 6.08 97.09
So, the current implementation can lead to performance regression, and
the choice of the plan depends on the notoriously unreliable ndistinct
statistics. The regression is probably because skip scan always does
_bt_search to find the next unique tuple. I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?
--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
Hi Alexander.
On 9/13/18 9:01 AM, Alexander Kuzmenkov wrote:
While testing this patch
Thanks for the review !
I noticed that current implementation doesn't
perform well when we have lots of small groups of equal values. Here is
the execution time of index skip scan vs unique over index scan, in ms,
depending on the size of group. The benchmark script is attached.group size skip unique
1 2,293.85 132.55
5 464.40 106.59
10 239.61 102.02
50 56.59 98.74
100 32.56 103.04
500 6.08 97.09
Yes, this doesn't look good. Using your test case I'm seeing that unique
is being chosen when the group size is below 34, and skip above. This is
with the standard initdb configuration; did you change something else ?
Or did you force the default plan ?
So, the current implementation can lead to performance regression, and
the choice of the plan depends on the notoriously unreliable ndistinct
statistics.
Yes, Peter mentioned this, which I'm still looking at.
The regression is probably because skip scan always does
_bt_search to find the next unique tuple.
Very likely.
I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?
This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the meantime
of course.
Thanks again !
Best regards,
Jesper
El 13/09/18 a las 18:39, Jesper Pedersen escribió:
Yes, this doesn't look good. Using your test case I'm seeing that
unique is being chosen when the group size is below 34, and skip
above. This is with the standard initdb configuration; did you change
something else ? Or did you force the default plan ?
Sorry I didn't mention this, the first column is indeed forced skip
scan, just to see how it compares to index scan.
This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we
just have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the
meantime of course.
I probably won't be able to contribute the changes, but I'll try to
review them.
--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
El 13/09/18 a las 18:39, Jesper Pedersen escribió:
I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the meantime
of course.
But having this logic inside _bt_next means that it will make a non-skip index
only scan a bit slower, am I right? Probably it would be easier and more
straightforward to go with the idea of dynamic fallback then. The first naive
implementation that I came up with is to wrap an index scan node into a unique,
and remember estimated number of groups into IndexOnlyScanState, so that we can
check if we performed much more skips than expected. With this approach index
skip scan will work a bit slower than in the original patch in case if
ndistinct is correct (because a unique node will recheck rows we returned), and
fallback to unique + index only scan in case if planner has underestimated
ndistinct.
Attachments:
index-skip-fallback.patchapplication/octet-stream; name=index-skip-fallback.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..fd06549491 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index a57c5e2e1f..842db029fa 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1312,6 +1312,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc921f..f9451768c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -791,6 +792,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..52c24c7541 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,121 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /* Release the current associated buffer */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(scan->indexRelation, scan->xs_itup);
+ }
+ else
+ {
+ Relation rel;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ rel = scan->indexRelation;
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) | (rel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Use _bt_search and _bt_binsrch to get the buffer and offset number */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 6d59b316ae..157b008284 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -59,6 +59,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..7e038cd9a3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -993,7 +993,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
pname = sname = "Index Scan";
break;
case T_IndexOnlyScan:
- pname = sname = "Index Only Scan";
+ {
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ pname = sname = "Index Skip Scan";
+ else
+ pname = sname = "Index Only Scan";
+ }
break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
@@ -1222,6 +1228,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..b2acc22ef9 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,21 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted
+ && node->ioss_NumOfSkips < 10 * node->ioss_PlanRows)
+ {
+ node->ioss_NumOfSkips += 1;
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +526,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
+ indexstate->ioss_NumOfSkips = 0;
+ indexstate->ioss_PlanRows = node->scan.plan.plan_rows;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b5af904c18..6892d94dcf 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -582,6 +582,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3254524223..ab118c4b57 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1780,6 +1780,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 96bf0601a8..7efc6f0fe7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4720,6 +4720,25 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, (Path *)
+ create_upper_unique_path(root, distinct_rel,
+ subpath,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0625eff219..04eeb984cd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7486d20a34..3637133d18 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 24c720bf42..4c260115b4 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c830f141b1..ffde4f5c98 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,7 +1343,11 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
+ int ioss_PlanRows;
+ int ioss_NumOfSkips;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index be25101db2..01bae00fdb 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..8c44417ace 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,23 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------------
+ Unique
+ Output: four
+ -> Index Skip Scan using tenk1_four on public.tenk1
+ Output: four
+(4 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f9e7118f0d..3a7d59243c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..db222fa510 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,9 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
Hi Dmitry,
On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
El 13/09/18 a las 18:39, Jesper Pedersen escribió:I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the meantime
of course.But having this logic inside _bt_next means that it will make a non-skip index
only scan a bit slower, am I right?
Correct.
Probably it would be easier and more
straightforward to go with the idea of dynamic fallback then. The first naive
implementation that I came up with is to wrap an index scan node into a unique,
and remember estimated number of groups into IndexOnlyScanState, so that we can
check if we performed much more skips than expected. With this approach index
skip scan will work a bit slower than in the original patch in case if
ndistinct is correct (because a unique node will recheck rows we returned), and
fallback to unique + index only scan in case if planner has underestimated
ndistinct.
I think we need a comment on this in the patch, as 10 *
node->ioss_PlanRows looks a bit random.
Thanks for your contribution !
Best regards,
Jesper
On Thu, 27 Sep 2018 at 15:59, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
I think we need a comment on this in the patch, as 10 *
node->ioss_PlanRows looks a bit random.
Yeah, you're right, it's a bit arbitrary number - we just need to make sure
that this estimation is not too small (to avoid false positives), but also not
too big (to not miss a proper point for fallback). I left it uncommented mostly
because I wanted to get some feedback on it first, and probably some
suggestions about how to make this estimation better.
Hi
I tested last patch and I have some notes:
1.
postgres=# explain select distinct a10000 from foo;
+-------------------------------------------------------------------------------------------+
| QUERY PLAN |
+-------------------------------------------------------------------------------------------+
| Unique (cost=0.43..4367.56 rows=9983 width=4) |
| -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
+-------------------------------------------------------------------------------------------+
(2 rows)
In this case Unique node is useless and can be removed
2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
3. Once time patched postgres crashed, but I am not able to reproduce it.
Looks like very interesting patch, and important for some BI platforms
Hi Pavel,
On 10/9/18 9:42 AM, Pavel Stehule wrote:
I tested last patch and I have some notes:
1.
postgres=# explain select distinct a10000 from foo; +-------------------------------------------------------------------------------------------+ | QUERY PLAN | +-------------------------------------------------------------------------------------------+ | Unique (cost=0.43..4367.56 rows=9983 width=4) | | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) | +-------------------------------------------------------------------------------------------+ (2 rows)In this case Unique node is useless and can be removed
2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
3. Once time patched postgres crashed, but I am not able to reproduce it.
Please, send that query through if you can replicate it. The patch
currently passes an assert'ed check-world, so your query clearly
triggered something that isn't covered yet.
Looks like very interesting patch, and important for some BI platforms
Thanks for your review !
Best regards,
Jesper
On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi
I tested last patch and I have some notes:
1.
postgres=# explain select distinct a10000 from foo; +-------------------------------------------------------------------------------------------+ | QUERY PLAN | +-------------------------------------------------------------------------------------------+ | Unique (cost=0.43..4367.56 rows=9983 width=4) | | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) | +-------------------------------------------------------------------------------------------+ (2 rows)In this case Unique node is useless and can be removed
Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).
2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.
3. Once time patched postgres crashed, but I am not able to reproduce it.
Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?
út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6@gmail.com>
napsal:
On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com>
wrote:
Hi
I tested last patch and I have some notes:
1.
postgres=# explain select distinct a10000 from foo;
+-------------------------------------------------------------------------------------------+
| QUERY PLAN
|
+-------------------------------------------------------------------------------------------+
| Unique (cost=0.43..4367.56 rows=9983 width=4)
|
| -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60
rows=9983 width=4) |
+-------------------------------------------------------------------------------------------+
(2 rows)
In this case Unique node is useless and can be removed
Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).
I tested index-skip-fallback.patch.
It looks like good idea, but then the node should be named "index scan" and
other info can be displayed in detail parts. It can be similar like "sort".
The combination of unique and index skip scan looks strange :)
2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.
ok
Now, the development cycle is starting - maybe it can use same
infrastructure like MIN/MAX and this part can be short.
more if you use dynamic index scan
3. Once time patched postgres crashed, but I am not able to reproduce it.
Maybe you have at least some ideas what could cause that or what's the
possible
way to reproduce that doesn't work anymore?
I think it was query like
select count(*) from (select distinct x from tab) s
út 9. 10. 2018 v 16:13 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
napsal:
út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6@gmail.com>
napsal:On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com>
wrote:
Hi
I tested last patch and I have some notes:
1.
postgres=# explain select distinct a10000 from foo;
+-------------------------------------------------------------------------------------------+
| QUERY PLAN
|
+-------------------------------------------------------------------------------------------+
| Unique (cost=0.43..4367.56 rows=9983 width=4)
|
| -> Index Skip Scan using foo_a10000_idx on foo
(cost=0.43..4342.60 rows=9983 width=4) |
+-------------------------------------------------------------------------------------------+
(2 rows)
In this case Unique node is useless and can be removed
Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).I tested index-skip-fallback.patch.
It looks like good idea, but then the node should be named "index scan"
and other info can be displayed in detail parts. It can be similar like
"sort".The combination of unique and index skip scan looks strange :)
maybe we don't need special index skip scan node - maybe possibility to
return unique values from index scan node can be good enough - some like
"distinct index scan" - and the implementation there can be dynamic -skip
scan, classic index scan,
"index skip scan" is not good name if the implementaion is dynamic.
Show quoted text
2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
Yep, as far as I understand MIN/MAX is going to be the next step after
this
patch will be accepted.ok
Now, the development cycle is starting - maybe it can use same
infrastructure like MIN/MAX and this part can be short.more if you use dynamic index scan
3. Once time patched postgres crashed, but I am not able to reproduce
it.
Maybe you have at least some ideas what could cause that or what's the
possible
way to reproduce that doesn't work anymore?I think it was query like
select count(*) from (select distinct x from tab) s
On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
It looks like good idea, but then the node should be named "index scan" and
other info can be displayed in detail parts. It can be similar like "sort".
The combination of unique and index skip scan looks strange :)maybe we don't need special index skip scan node - maybe possibility to
return unique values from index scan node can be good enough - some like
"distinct index scan" - and the implementation there can be dynamic -skip
scan, classic index scan,"index skip scan" is not good name if the implementaion is dynamic.
Yeah, that's a valid point. The good part is that index skip scan is not really
a separate node, but just enhanced index only scan node. So indeed maybe it
would be better to call it Index Only Scan, but show in details that we apply
the skip scan strategy. Any other opinions about this?
I think it was query like
select count(*) from (select distinct x from tab) s
Thanks, I'll take a look.
On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
I noticed that current implementation doesn't
perform well when we have lots of small groups of equal values. Here is
the execution time of index skip scan vs unique over index scan, in ms,
depending on the size of group. The benchmark script is attached.group size skip unique
1 2,293.85 132.55
5 464.40 106.59
10 239.61 102.02
50 56.59 98.74
100 32.56 103.04
500 6.08 97.09Yes, this doesn't look good. Using your test case I'm seeing that unique
is being chosen when the group size is below 34, and skip above.
I'm not sure exactly how the current patch is approaching the problem,
but it seems like it might be reasonable to do something like -- look
for a distinct item within the current page; if not found, then search
from the root of the tree for the next item > the current item.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 12 Oct 2018 at 19:44, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:I noticed that current implementation doesn't
perform well when we have lots of small groups of equal values. Here is
the execution time of index skip scan vs unique over index scan, in ms,
depending on the size of group. The benchmark script is attached.group size skip unique
1 2,293.85 132.55
5 464.40 106.59
10 239.61 102.02
50 56.59 98.74
100 32.56 103.04
500 6.08 97.09Yes, this doesn't look good. Using your test case I'm seeing that unique
is being chosen when the group size is below 34, and skip above.I'm not sure exactly how the current patch is approaching the problem,
but it seems like it might be reasonable to do something like -- look
for a distinct item within the current page; if not found, then search
from the root of the tree for the next item > the current item.
I'm not sure that I understand it correctly, can you elaborate please? From
what I see it's quite similar to what's already implemented - we look for a
distinct item on the page, and then search the index tree for a next distinct
item.
On Wed, 10 Oct 2018 at 17:34, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
It looks like good idea, but then the node should be named "index scan" and
other info can be displayed in detail parts. It can be similar like "sort".
The combination of unique and index skip scan looks strange :)maybe we don't need special index skip scan node - maybe possibility to
return unique values from index scan node can be good enough - some like
"distinct index scan" - and the implementation there can be dynamic -skip
scan, classic index scan,"index skip scan" is not good name if the implementaion is dynamic.
Yeah, that's a valid point. The good part is that index skip scan is not really
a separate node, but just enhanced index only scan node. So indeed maybe it
would be better to call it Index Only Scan, but show in details that we apply
the skip scan strategy. Any other opinions about this?
To make it more clean what I mean, see attached version of the patch.
I think it was query like
select count(*) from (select distinct x from tab) sThanks, I'll take a look.
I couldn't reproduce it either yet.
Attachments:
index-skip-fallback-v2.patchapplication/octet-stream; name=index-skip-fallback-v2.patchDownload
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e1073ac6d3..8c79fc33ba 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index df7d16ff68..a5b1835e72 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1319,6 +1319,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index eade540ef5..7d04388b18 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..52c24c7541 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,121 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /* Release the current associated buffer */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(scan->indexRelation, scan->xs_itup);
+ }
+ else
+ {
+ Relation rel;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ rel = scan->indexRelation;
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) | (rel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Use _bt_search and _bt_binsrch to get the buffer and offset number */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9919e6f0d7..0b77998886 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..88ef72bc62 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1222,6 +1222,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1444,6 +1452,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..b2acc22ef9 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,21 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted
+ && node->ioss_NumOfSkips < 10 * node->ioss_PlanRows)
+ {
+ node->ioss_NumOfSkips += 1;
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +526,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
+ indexstate->ioss_NumOfSkips = 0;
+ indexstate->ioss_PlanRows = node->scan.plan.plan_rows;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 93f1e2c4eb..d28ef70db0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -586,6 +586,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 519deab63a..ddf565d92d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1800,6 +1800,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 22c010c19e..e28f98e4ce 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4731,6 +4731,25 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, (Path *)
+ create_upper_unique_path(root, distinct_rel,
+ subpath,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 77662aff7f..2bae6e06b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..87ff031a85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 534fac7bf2..ab973f0b5f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd2f4..319d3ba342 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,7 +1343,11 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
+ int ioss_PlanRows;
+ int ioss_NumOfSkips;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 0065e325c2..61053f0fb2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..485454fcfa 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,24 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------------
+ Unique
+ Output: four
+ -> Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(5 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be7f261871..3c73198179 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..db222fa510 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,9 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
Hello
Last published patch index-skip-fallback-v2 applied and builds cleanly.
I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
create table tablename (i int primary key);
select distinct i from tablename where i = 1;
Query is obviously strange, but this is bug.
Also i noticed two TODO in documentation.
regards, Sergei
On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote:
I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
create table tablename (i int primary key);
select distinct i from tablename where i = 1;Query is obviously strange, but this is bug.
Wow, thanks a lot! I can reproduce it too, will fix it.
On Mon, 12 Nov 2018 at 13:55, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote:
I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
create table tablename (i int primary key);
select distinct i from tablename where i = 1;Query is obviously strange, but this is bug.
Wow, thanks a lot! I can reproduce it too, will fix it.
Yep, we had to check number of distinct columns too, here is the fixed patch
(with a bit more verbose commit message).
Attachments:
0001-Index-skip-scan-with-fallback-v3.patchapplication/octet-stream; name=0001-Index-skip-scan-with-fallback-v3.patchDownload
From eecc755713b2a3647a5ff2f0da85e27d6e35588c Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan with fallback
Implementation of Index Skip Scan (see Loose Index Scan in the wiki
[1]). Since it relies on ndistinct, which can be unreliable, it's
implemented not as a separate node, but as a new strategy for Index Only
Scan with the possibility of fallback to a regular index only scan, if
there are too many distinct values than was estimated.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++++
doc/src/sgml/indexam.sgml | 9 ++
doc/src/sgml/indices.sgml | 16 ++++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++++
src/backend/access/nbtree/nbtree.c | 12 +++
src/backend/access/nbtree/nbtsearch.c | 115 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 +++
src/backend/executor/nodeIndexonlyscan.c | 21 +++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 ++-
src/backend/optimizer/plan/planner.c | 20 +++++
src/backend/optimizer/util/pathnode.c | 39 +++++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 ++
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 ++
src/include/nodes/execnodes.h | 6 ++
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 5 ++
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 ++
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 27 ++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 372 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e1073ac6d3..8c79fc33ba 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index df7d16ff68..a5b1835e72 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1319,6 +1319,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index eade540ef5..7d04388b18 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..52c24c7541 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,121 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /* Release the current associated buffer */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(scan->indexRelation, scan->xs_itup);
+ }
+ else
+ {
+ Relation rel;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ rel = scan->indexRelation;
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) | (rel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Use _bt_search and _bt_binsrch to get the buffer and offset number */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9919e6f0d7..0b77998886 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..88ef72bc62 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1222,6 +1222,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1444,6 +1452,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..b2acc22ef9 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,21 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted
+ && node->ioss_NumOfSkips < 10 * node->ioss_PlanRows)
+ {
+ node->ioss_NumOfSkips += 1;
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +526,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
+ indexstate->ioss_NumOfSkips = 0;
+ indexstate->ioss_PlanRows = node->scan.plan.plan_rows;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 93f1e2c4eb..d28ef70db0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -586,6 +586,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 519deab63a..ddf565d92d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1800,6 +1800,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 22c010c19e..cb2232e94b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4731,6 +4731,26 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, (Path *)
+ create_upper_unique_path(root, distinct_rel,
+ subpath,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 77662aff7f..2bae6e06b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..87ff031a85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 534fac7bf2..ab973f0b5f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd2f4..319d3ba342 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,7 +1343,11 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
+ int ioss_PlanRows;
+ int ioss_NumOfSkips;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 0065e325c2..61053f0fb2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..247ba75142 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,30 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------------
+ Unique
+ Output: four
+ -> Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(5 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be7f261871..3c73198179 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On 9/27/18 16:59, Jesper Pedersen wrote:
Hi Dmitry,
On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov
<a.kuzmenkov@postgrespro.ru> wrote:
El 13/09/18 a las 18:39, Jesper Pedersen escribió:I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data.
<...>This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we
just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the
meantime
of course.But having this logic inside _bt_next means that it will make a
non-skip index
only scan a bit slower, am I right?Correct.
Well, it depends on how the skip scan is implemented. We don't have to
make normal scans slower, because skip scan is just a separate thing.
My main point was that current implementation is good as a proof of
concept, but it is inefficient for some data and needs some unreliable
planner logic to work around this inefficiency. And now we also have
execution-time fallback because planning-time fallback isn't good
enough. This looks overly complicated. Let's try to design an algorithm
that works well regardless of the particular data and doesn't need these
heuristics. It should be possible to do so for btree.
Of course, I understand the reluctance to implement an entire new type
of btree traversal. Anastasia Lubennikova suggested a tweak for the
current method that may improve the performance for small groups of
equal values. When we search for the next unique key, first check if it
is contained on the current btree page using its 'high key'. If it is,
find it on the current page. If not, search from the root of the tree
like we do now. This should improve the performance for small equal
groups, because there are going to be several such groups on the page.
And this is exactly where we have the regression compared to unique +
index scan.
By the way, what is the data for which we intend this feature to work?
Obviously a non-unique btree index, but how wide are the tuples, and how
big the equal groups? It would be good to have some real-world examples.
--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Is skip scan only possible for index-only scan?
I haven't see much discussion of this question yet. Is there a
particular reason to lock ourselves into thinking about this only in
an index only scan?
I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the meantime
of course.
It seems to me also that the logic necessary for this kind of
traversal has other useful applications. For example, it should be
possible to build on that logic to allow and index like t(owner_fk,
created_at) to be used to execute the following query:
select *
from t
where owner_fk in (1,2,3)
order by created_at
limit 25
without needing to fetch all tuples satisfying "owner_fk in (1,2,3)"
and subsequently sorting them.
Hi,
On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
But having this logic inside _bt_next means that it will make a
non-skip index
only scan a bit slower, am I right?Correct.
Well, it depends on how the skip scan is implemented. We don't have to
make normal scans slower, because skip scan is just a separate thing.My main point was that current implementation is good as a proof of
concept, but it is inefficient for some data and needs some unreliable
planner logic to work around this inefficiency. And now we also have
execution-time fallback because planning-time fallback isn't good
enough. This looks overly complicated. Let's try to design an algorithm
that works well regardless of the particular data and doesn't need these
heuristics. It should be possible to do so for btree.Of course, I understand the reluctance to implement an entire new type
of btree traversal. Anastasia Lubennikova suggested a tweak for the
current method that may improve the performance for small groups of
equal values. When we search for the next unique key, first check if it
is contained on the current btree page using its 'high key'. If it is,
find it on the current page. If not, search from the root of the tree
like we do now. This should improve the performance for small equal
groups, because there are going to be several such groups on the page.
And this is exactly where we have the regression compared to unique +
index scan.
Robert suggested something similar in [1]/messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com. I'll try and look at that
when I'm back from my holiday.
By the way, what is the data for which we intend this feature to work?
Obviously a non-unique btree index, but how wide are the tuples, and how
big the equal groups? It would be good to have some real-world examples.
Although my primary use-case is int I agree that we should test the
different data types, and tuple widths.
[1]: /messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com
/messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com
Best regards,
Jesper
On Fri, 16 Nov 2018 at 16:06, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
But having this logic inside _bt_next means that it will make a
non-skip index
only scan a bit slower, am I right?Correct.
Well, it depends on how the skip scan is implemented. We don't have to
make normal scans slower, because skip scan is just a separate thing.My main point was that current implementation is good as a proof of
concept, but it is inefficient for some data and needs some unreliable
planner logic to work around this inefficiency. And now we also have
execution-time fallback because planning-time fallback isn't good
enough. This looks overly complicated. Let's try to design an algorithm
that works well regardless of the particular data and doesn't need these
heuristics. It should be possible to do so for btree.Of course, I understand the reluctance to implement an entire new type
of btree traversal. Anastasia Lubennikova suggested a tweak for the
current method that may improve the performance for small groups of
equal values. When we search for the next unique key, first check if it
is contained on the current btree page using its 'high key'. If it is,
find it on the current page. If not, search from the root of the tree
like we do now. This should improve the performance for small equal
groups, because there are going to be several such groups on the page.
And this is exactly where we have the regression compared to unique +
index scan.Robert suggested something similar in [1]. I'll try and look at that
when I'm back from my holiday.
Yeah, probably you're right. Unfortunately, I've misunderstood the previous
Robert's message in this thread with the similar approach. Jesper, I hope you
don't mind if I'll post an updated patch? _bt_skip is changed there in the
suggested way, so that it checks the current page before searching from the
root of a tree, and I've removed the fallback logic. After some
initial tests I see
that with this version skip scan over a table with 10^7 rows and 10^6
distinct values is slightly slower than a regular scan, but not that much.
By the way, what is the data for which we intend this feature to work?
Obviously a non-unique btree index, but how wide are the tuples, and how
big the equal groups? It would be good to have some real-world examples.Although my primary use-case is int I agree that we should test the
different data types, and tuple widths.
My personal motivation here is exactly that we face use-cases for skip scan
from time to time. Usually it's quite few distinct values (up to a dozen or
so, which means that equal groups are quite big), but with the variety of types
and widths.
On Thu, 15 Nov 2018 at 15:28, James Coleman <jtc331@gmail.com> wrote:
Is skip scan only possible for index-only scan?
I haven't see much discussion of this question yet. Is there a
particular reason to lock ourselves into thinking about this only in
an index only scan?
I guess, the only reason is to limit the scope of the patch.
Attachments:
0001-Index-skip-scan-v4.patchapplication/octet-stream; name=0001-Index-skip-scan-v4.patchDownload
From 361ee64c7ba69ce4cbd57fc771bd5a25046bee4a Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 9 ++
doc/src/sgml/indices.sgml | 16 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 153 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 +++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 5 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 398 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e1073ac6d3..8c79fc33ba 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index df7d16ff68..a5b1835e72 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1319,6 +1319,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index eade540ef5..7d04388b18 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..95c17f142e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,159 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ OffsetNumber high;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ high = PageGetMaxOffsetNumber(page);
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, high) < 0)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ if (ScanDirectionIsForward(dir))
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreRight = true;
+
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ so->currPos.moreLeft = true;
+ so->currPos.moreRight = false;
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9919e6f0d7..0b77998886 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..88ef72bc62 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1222,6 +1222,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1444,6 +1452,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..e69157741f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +260,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +524,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 93f1e2c4eb..d28ef70db0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -586,6 +586,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 519deab63a..ddf565d92d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1800,6 +1800,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 22c010c19e..2b3e46eb02 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4731,6 +4731,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 77662aff7f..2bae6e06b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..87ff031a85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 534fac7bf2..ab973f0b5f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd2f4..93f02f89be 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,6 +1343,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 0065e325c2..61053f0fb2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be7f261871..3c73198179 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On 11/18/18 02:27, Dmitry Dolgov wrote:
[0001-Index-skip-scan-v4.patch]
I ran a couple of tests on this, please see the cases below. As before,
I'm setting total_cost = 1 for index skip scan so that it is chosen.
Case 1 breaks because we determine the high key incorrectly, it is the
second tuple on page or something like that, not the last tuple. Case 2
is backwards scan, I don't understand how it is supposed to work. We
call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it
just fetches the previous tuple like the regular scan does.
case 1:
# create table t as select generate_series(1, 1000000) a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t) d;
QUERY PLAN
-------------------------------------------------------------------------
Aggregate (cost=3.50..3.51 rows=1 width=8)
-> Index Only Scan using ta on t (cost=0.42..1.00 rows=200 width=4)
Scan mode: Skip scan
(3 rows)
postgres=# select count(*) from (select distinct a from t) d;
count
--------
500000 -- should be 1kk
(1 row)
case 2:
# create table t as select generate_series(1, 1000000) / 2 a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t order by a desc) d;
QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5980.81..5980.82 rows=1 width=8)
-> Index Only Scan Backward using ta on t (cost=0.42..1.00
rows=478385 width=4)
Scan mode: Skip scan
(3 rows)
# select count(*) from (select distinct a from t order by a desc) d;
count
--------
502733 -- should be 500k
(1 row)
--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
On 11/18/18 02:27, Dmitry Dolgov wrote:
[0001-Index-skip-scan-v4.patch]
I ran a couple of tests on this, please see the cases below. As before,
I'm setting total_cost = 1 for index skip scan so that it is chosen.
Case 1 breaks because we determine the high key incorrectly, it is the
second tuple on page or something like that, not the last tuple.
From what I see it wasn't about the high key, just a regular off by one error.
But anyway, thanks for noticing - for some reason it wasn't always
reproduceable for me, so I missed this issue. Please find fixed patch attached.
Also I think it invalidates my previous performance tests, so I would
appreciate if you can check it out too.
Case 2
is backwards scan, I don't understand how it is supposed to work. We
call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it
just fetches the previous tuple like the regular scan does.
Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
remember from the previous discussions the entire topic of backward scan is
questionable for this patch, so I'll try to invest some time in it.
Attachments:
0001-Index-skip-scan-v5.patchapplication/octet-stream; name=0001-Index-skip-scan-v5.patchDownload
From d70864e031c007f08ea45c2c1c751ee4a9e3d3e3 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 9 ++
doc/src/sgml/indices.sgml | 16 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 164 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 ++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 5 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 ++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 409 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e1073ac6d3..8c79fc33ba 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3725,6 +3725,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index beb99d1831..ccbb44288d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -665,6 +666,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index df7d16ff68..a5b1835e72 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1319,6 +1319,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbcea7..85d6571c6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -106,6 +106,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182dd7..162639090d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ecd4af49d8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df30c0..7120950868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index eade540ef5..7d04388b18 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_amroutine->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_amroutine->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..3d02a96dad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..15479707ee 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1193,6 +1193,170 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9919e6f0d7..0b77998886 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0ea1..88ef72bc62 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1222,6 +1222,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1444,6 +1452,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8c32a74d39..e69157741f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -112,6 +112,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -247,6 +260,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -509,6 +524,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..012f61d1ad 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 93f1e2c4eb..d28ef70db0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -586,6 +586,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 519deab63a..ddf565d92d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1800,6 +1800,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b4c4edd276 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae41c9efa0..9569f45745 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,7 +175,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2706,7 +2707,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5115,7 +5117,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5130,6 +5133,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 22c010c19e..2b3e46eb02 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4731,6 +4731,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c5aaaf5c22..b9e7baa5d4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2768,6 +2768,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..f07aba1c9c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 77662aff7f..2bae6e06b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -852,6 +852,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..87ff031a85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -297,6 +297,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6bb2..81e1ea5d5f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 534fac7bf2..ab973f0b5f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..6009edb22d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -471,6 +471,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -571,6 +574,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -598,6 +602,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd2f4..93f02f89be 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1323,6 +1323,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1341,6 +1343,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c2abbd03a..1e572853d8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -438,6 +438,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index adb4265047..27adafd6a6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -811,6 +811,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1161,6 +1162,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1175,6 +1179,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..1fb3de6fa6 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7c5ff22650..349b062ee2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 0065e325c2..61053f0fb2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be7f261871..3c73198179 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
remember from the previous discussions the entire topic of backward scan is
questionable for this patch, so I'll try to invest some time in it.
Another thing that I think is related to skip scans that you should be
aware of is dynamic prefix truncation, which I started a thread on
just now [1]/messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan. While I see one big problem with the POC patch I came up
with, I think that that optimization is going to be something that
ends up happening at some point. Repeatedly descending a B-Tree when
the leading column is very low cardinality can be made quite a lot
less expensive by dynamic prefix truncation. Actually, it's almost a
perfect case for it.
I'm not asking anybody to do anything with that information. "Big
picture" thinking seems particularly valuable when working on the
B-Tree code; I don't want anybody to miss a possible future
opportunity.
[1]: /messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
On 11/18/18 02:27, Dmitry Dolgov wrote:
[0001-Index-skip-scan-v4.patch]
I ran a couple of tests on this, please see the cases below. As before,
I'm setting total_cost = 1 for index skip scan so that it is chosen.
Case 1 breaks because we determine the high key incorrectly, it is the
second tuple on page or something like that, not the last tuple.From what I see it wasn't about the high key, just a regular off by one error.
But anyway, thanks for noticing - for some reason it wasn't always
reproduceable for me, so I missed this issue. Please find fixed patch attached.
Also I think it invalidates my previous performance tests, so I would
appreciate if you can check it out too.
I've performed some testing, and on my environment with a dataset of 10^7
records:
* everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
scan.
* above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
records it was about 20% slower than the regular index scan.
For me these numbers sound good, since even in quite extreme case of
approximately 10 records per group the performance of index skip scan is close
to the same for the regular index only scan.
On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
remember from the previous discussions the entire topic of backward scan is
questionable for this patch, so I'll try to invest some time in it.Another thing that I think is related to skip scans that you should be
aware of is dynamic prefix truncation, which I started a thread on
just now [1]. While I see one big problem with the POC patch I came up
with, I think that that optimization is going to be something that
ends up happening at some point. Repeatedly descending a B-Tree when
the leading column is very low cardinality can be made quite a lot
less expensive by dynamic prefix truncation. Actually, it's almost a
perfect case for it.
Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.
On Thu, Dec 20, 2018 at 2:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I've performed some testing, and on my environment with a dataset of 10^7
records:* everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
scan.* above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
records it was about 20% slower than the regular index scan.For me these numbers sound good, since even in quite extreme case of
approximately 10 records per group the performance of index skip scan is close
to the same for the regular index only scan.
Rebased version after rd_amroutine was renamed.
Attachments:
0001-Index-skip-scan-v6.patchapplication/octet-stream; name=0001-Index-skip-scan-v6.patchDownload
From 9fce35b14ba11eb3a6a6e7c114e26921cbfd8983 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 9 ++
doc/src/sgml/indices.sgml | 193 ++++++++++++++++++++++++++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 164 ++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 ++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 5 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 ++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 586 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..395b7de7e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1550fcfb86 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,14 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ TODO
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..60f306571b 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1391,6 +1391,199 @@ CREATE INDEX test1c_content_y_index ON test1c (content COLLATE "y");
</sect1>
+ <sect1 id="indexes-index-only-scans">
+ <title>Index-Only Scans</title>
+
+ <indexterm zone="indexes-index-only-scans">
+ <primary>index</primary>
+ <secondary>index-only scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-only-scans">
+ <primary>index-only scan</primary>
+ </indexterm>
+
+ <para>
+ All indexes in <productname>PostgreSQL</productname> are <firstterm>secondary</firstterm>
+ indexes, meaning that each index is stored separately from the table's
+ main data area (which is called the table's <firstterm>heap</firstterm>
+ in <productname>PostgreSQL</productname> terminology). This means that in an
+ ordinary index scan, each row retrieval requires fetching data from both
+ the index and the heap. Furthermore, while the index entries that match a
+ given indexable <literal>WHERE</literal> condition are usually close together in
+ the index, the table rows they reference might be anywhere in the heap.
+ The heap-access portion of an index scan thus involves a lot of random
+ access into the heap, which can be slow, particularly on traditional
+ rotating media. (As described in <xref linkend="indexes-bitmap-scans"/>,
+ bitmap scans try to alleviate this cost by doing the heap accesses in
+ sorted order, but that only goes so far.)
+ </para>
+
+ <para>
+ To solve this performance problem, <productname>PostgreSQL</productname>
+ supports <firstterm>index-only scans</firstterm>, which can answer queries from an
+ index alone without any heap access. The basic idea is to return values
+ directly out of each index entry instead of consulting the associated heap
+ entry. There are two fundamental restrictions on when this method can be
+ used:
+
+ <orderedlist>
+ <listitem>
+ <para>
+ The index type must support index-only scans. B-tree indexes always
+ do. GiST and SP-GiST indexes support index-only scans for some
+ operator classes but not others. Other index types have no support.
+ The underlying requirement is that the index must physically store, or
+ else be able to reconstruct, the original data value for each index
+ entry. As a counterexample, GIN indexes cannot support index-only
+ scans because each index entry typically holds only part of the
+ original data value.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The query must reference only columns stored in the index. For
+ example, given an index on columns <literal>x</literal> and <literal>y</literal> of a
+ table that also has a column <literal>z</literal>, these queries could use
+ index-only scans:
+<programlisting>
+SELECT x, y FROM tab WHERE x = 'key';
+SELECT x FROM tab WHERE x = 'key' AND y < 42;
+</programlisting>
+ but these queries could not:
+<programlisting>
+SELECT x, z FROM tab WHERE x = 'key';
+SELECT x FROM tab WHERE x = 'key' AND z < 42;
+</programlisting>
+ (Expression indexes and partial indexes complicate this rule,
+ as discussed below.)
+ </para>
+ </listitem>
+ </orderedlist>
+ </para>
+
+ <para>
+ If these two fundamental requirements are met, then all the data values
+ required by the query are available from the index, so an index-only scan
+ is physically possible. But there is an additional requirement for any
+ table scan in <productname>PostgreSQL</productname>: it must verify that each
+ retrieved row be <quote>visible</quote> to the query's MVCC snapshot, as
+ discussed in <xref linkend="mvcc"/>. Visibility information is not stored
+ in index entries, only in heap entries; so at first glance it would seem
+ that every row retrieval would require a heap access anyway. And this is
+ indeed the case, if the table row has been modified recently. However,
+ for seldom-changing data there is a way around this
+ problem. <productname>PostgreSQL</productname> tracks, for each page in a table's
+ heap, whether all rows stored in that page are old enough to be visible to
+ all current and future transactions. This information is stored in a bit
+ in the table's <firstterm>visibility map</firstterm>. An index-only scan, after
+ finding a candidate index entry, checks the visibility map bit for the
+ corresponding heap page. If it's set, the row is known visible and so the
+ data can be returned with no further work. If it's not set, the heap
+ entry must be visited to find out whether it's visible, so no performance
+ advantage is gained over a standard index scan. Even in the successful
+ case, this approach trades visibility map accesses for heap accesses; but
+ since the visibility map is four orders of magnitude smaller than the heap
+ it describes, far less physical I/O is needed to access it. In most
+ situations the visibility map remains cached in memory all the time.
+ </para>
+
+ <para>
+ In short, while an index-only scan is possible given the two fundamental
+ requirements, it will be a win only if a significant fraction of the
+ table's heap pages have their all-visible map bits set. But tables in
+ which a large fraction of the rows are unchanging are common enough to
+ make this type of scan very useful in practice.
+ </para>
+
+ <para>
+ To make effective use of the index-only scan feature, you might choose to
+ create indexes in which only the leading columns are meant to
+ match <literal>WHERE</literal> clauses, while the trailing columns
+ hold <quote>payload</quote> data to be returned by a query. For example, if
+ you commonly run queries like
+<programlisting>
+SELECT y FROM tab WHERE x = 'key';
+</programlisting>
+ the traditional approach to speeding up such queries would be to create an
+ index on <literal>x</literal> only. However, an index on <literal>(x, y)</literal>
+ would offer the possibility of implementing this query as an index-only
+ scan. As previously discussed, such an index would be larger and hence
+ more expensive than an index on <literal>x</literal> alone, so this is attractive
+ only if the table is known to be mostly static. Note it's important that
+ the index be declared on <literal>(x, y)</literal> not <literal>(y, x)</literal>, as for
+ most index types (particularly B-trees) searches that do not constrain the
+ leading index columns are not very efficient.
+ </para>
+
+ <para>
+ In principle, index-only scans can be used with expression indexes.
+ For example, given an index on <literal>f(x)</literal> where <literal>x</literal> is a
+ table column, it should be possible to execute
+<programlisting>
+SELECT f(x) FROM tab WHERE f(x) < 1;
+</programlisting>
+ as an index-only scan; and this is very attractive if <literal>f()</literal> is
+ an expensive-to-compute function. However, <productname>PostgreSQL</productname>'s
+ planner is currently not very smart about such cases. It considers a
+ query to be potentially executable by index-only scan only when
+ all <emphasis>columns</emphasis> needed by the query are available from the index.
+ In this example, <literal>x</literal> is not needed except in the
+ context <literal>f(x)</literal>, but the planner does not notice that and
+ concludes that an index-only scan is not possible. If an index-only scan
+ seems sufficiently worthwhile, this can be worked around by declaring the
+ index to be on <literal>(f(x), x)</literal>, where the second column is not
+ expected to be used in practice but is just there to convince the planner
+ that an index-only scan is possible. An additional caveat, if the goal is
+ to avoid recalculating <literal>f(x)</literal>, is that the planner won't
+ necessarily match uses of <literal>f(x)</literal> that aren't in
+ indexable <literal>WHERE</literal> clauses to the index column. It will usually
+ get this right in simple queries such as shown above, but not in queries
+ that involve joins. These deficiencies may be remedied in future versions
+ of <productname>PostgreSQL</productname>.
+ </para>
+
+ <para>
+ Partial indexes also have interesting interactions with index-only scans.
+ Consider the partial index shown in <xref linkend="indexes-partial-ex3"/>:
+<programlisting>
+CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
+ WHERE success;
+</programlisting>
+ In principle, we could do an index-only scan on this index to satisfy a
+ query like
+<programlisting>
+SELECT target FROM tests WHERE subject = 'some-subject' AND success;
+</programlisting>
+ But there's a problem: the <literal>WHERE</literal> clause refers
+ to <literal>success</literal> which is not available as a result column of the
+ index. Nonetheless, an index-only scan is possible because the plan does
+ not need to recheck that part of the <literal>WHERE</literal> clause at run time:
+ all entries found in the index necessarily have <literal>success = true</literal>
+ so this need not be explicitly checked in the
+ plan. <productname>PostgreSQL</productname> versions 9.6 and later will recognize
+ such cases and allow index-only scans to be generated, but older versions
+ will not.
+ </para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ TODO
+ </para>
+ </sect2>
+ </sect1>
+
+
<sect1 id="indexes-examine">
<title>Examining Index Usage</title>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 467d91e681..720696f84e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..a9012dc1d1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,170 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index de147d7b68..a45edfa94b 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ae7f038203..487ffcb407 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1298,6 +1298,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1520,6 +1528,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..2d048a9725 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +520,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3eb7e95d64..5fcac97f2b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 0fde876c77..e24aa415f6 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ec6f2569ab..c2bf1bbd89 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1799,6 +1799,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 99c5ad9b4a..8d00f00c17 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 97d0c28132..aac8d2e796 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -171,7 +171,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2722,7 +2723,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -4996,7 +4998,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5011,6 +5014,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4465f002c8..a679bbbbde 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4710,6 +4710,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b2637d0e89..fcb6d140b7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2769,6 +2769,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 261492e6b7..b20faeaa50 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..71f31bbfeb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -893,6 +893,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..834a775773 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -345,6 +345,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7cae085177..6fbc023246 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3430061361..fd7b5996d9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -810,6 +810,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1160,6 +1161,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1174,6 +1178,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index e7005b4a0c..acfa5416f2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index bd905d3328..38b99cd41b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 46deb55c67..c9acae96d7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..588616446e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Rebased version after rd_amroutine was renamed.
And one more to fix the documentation. Also I've noticed few TODOs in the patch
about the missing docs, and replaced them with a required explanation of the
feature.
Attachments:
0001-Index-skip-scan-v7.patchapplication/octet-stream; name=0001-Index-skip-scan-v7.patchDownload
From a29e0825bbb17f28cfad9fd2619b0841b45d63b3 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 ++++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 164 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 ++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 5 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 ++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 418 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..395b7de7e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..461e4b00db 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 467d91e681..720696f84e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..a9012dc1d1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,170 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index de147d7b68..a45edfa94b 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ae7f038203..487ffcb407 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1298,6 +1298,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1520,6 +1528,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..2d048a9725 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +520,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3eb7e95d64..5fcac97f2b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 0fde876c77..e24aa415f6 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ec6f2569ab..c2bf1bbd89 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1799,6 +1799,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 99c5ad9b4a..8d00f00c17 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -122,6 +122,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 97d0c28132..aac8d2e796 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -171,7 +171,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2722,7 +2723,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -4996,7 +4998,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5011,6 +5014,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4465f002c8..a679bbbbde 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4710,6 +4710,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b2637d0e89..fcb6d140b7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2769,6 +2769,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 261492e6b7..b20faeaa50 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..71f31bbfeb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -893,6 +893,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..834a775773 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -345,6 +345,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7cae085177..6fbc023246 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3430061361..fd7b5996d9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -810,6 +810,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1160,6 +1161,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1174,6 +1178,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index e7005b4a0c..acfa5416f2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index bd905d3328..38b99cd41b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 46deb55c67..c9acae96d7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..588616446e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Sun, Jan 27, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Rebased version after rd_amroutine was renamed.
And one more to fix the documentation. Also I've noticed few TODOs in the patch
about the missing docs, and replaced them with a required explanation of the
feature.
A bit of adjustment after nodes/relation -> nodes/pathnodes.
Attachments:
0001-Index-skip-scan-v8.patchapplication/octet-stream; name=0001-Index-skip-scan-v8.patchDownload
From e6afcef3e94cb0aa1ace43ae6e574b7d7d1632ec Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 ++++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 164 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 ++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 ++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 418 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..395b7de7e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..461e4b00db 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 467d91e681..720696f84e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..a9012dc1d1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,170 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack =_bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..adf796a64f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..2d048a9725 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +520,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 807393dfaa..129ece294a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9d44e3e4c6..74246147ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 43491e297b..b7e5ca7556 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1802,6 +1802,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b8d406f230..bb194b7e0e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1b4f7db649..e335fffe84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -173,7 +173,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2752,7 +2753,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5064,7 +5066,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5079,6 +5082,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2239728cf..e6118ee35e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4732,6 +4732,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b57de6b4c6..e2384ba896 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2799,6 +2799,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 3efa1bdc1a..e89f22c6bd 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98d75be292..376141e9d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -894,6 +894,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..834a775773 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -345,6 +345,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..24be304768 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d3c477a542..b9e7bb0036 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -815,6 +815,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1166,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1179,6 +1183,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index d0c8f99d0a..96b5262715 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -190,6 +190,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 46deb55c67..c9acae96d7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..588616446e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
Hello.
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)
Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.
If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).
=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.
Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.
Any opinions?
====
Some comments on the patch below.
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
I'm not sure it is a good thing to put a pointer to rather
unstable source in the documentation.
This adds a new AM method but it seems avaiable only for ordered
indexes, specifically btree. And it seems that the feature can be
implemented in btgettuple since btskip apparently does the same
thing. (I agree to Robert in the point in [1]/messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com).
[1]: /messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com
Related to the above, it seems better that the path generation of
skip scan is a part of index scan. Whether skip scan or not is a
matter of index scan itself.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi,
On 1/31/19 1:31 AM, Kyotaro HORIGUCHI wrote:
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.Any opinions?
====
Some comments on the patch below.
+ skip scan approach, which is based on the idea of + <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems"> + Loose index scan</ulink>. Rather than scanning all equal values of a key, + as soon as a new value is found, it will search for a larger value on theI'm not sure it is a good thing to put a pointer to rather
unstable source in the documentation.This adds a new AM method but it seems avaiable only for ordered
indexes, specifically btree. And it seems that the feature can be
implemented in btgettuple since btskip apparently does the same
thing. (I agree to Robert in the point in [1]).[1] /messages/by-id/CA+Tgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr=OK5_kC_SSg@mail.gmail.com
Related to the above, it seems better that the path generation of
skip scan is a part of index scan. Whether skip scan or not is a
matter of index scan itself.
Thanks for your valuable feedback ! And, calling it "Loose index scan"
or something else is better.
Dmitry and I will look at this and take it into account for the next
version.
For now, I have switched the CF entry to WoA.
Thanks again !
Best regards,
Jesper
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.Any opinions?
Hi!
I'd like to offer a counterpoint: in cases where this a huge win we
definitely do want this to affect cost estimation, because if it's
purely an executor optimization the index scan path may not be chosen
even when skip scanning would be a dramatic improvement.
I suppose that both requirements could be met by incorporating it into
the existing index scanning code and also modifying to costing to
(only when we have high confidence?) account for the optimization. I'm
not sure if that makes things better than the current state of the
patch or not.
James Coleman
On 2019-02-01 16:04:58 -0500, James Coleman wrote:
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.Any opinions?
I'd like to offer a counterpoint: in cases where this a huge win we
definitely do want this to affect cost estimation, because if it's
purely an executor optimization the index scan path may not be chosen
even when skip scanning would be a dramatic improvement.
+many.
Greetings,
Andres Freund
On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
Dmitry and I will look at this and take it into account for the next
version.
In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before. And talking about
this topic I wanted to ask to clarify a few points, since looks like I'm
missing something:
One of not yet addressed points in this patch is amcanbackward. From the
historical thread, mentioned in the first email:
On 2016-11-25 at 01:33 Robert Haas <robertmhaas@gmail.com> wrote: + if (ScanDirectionIsForward(dir)) + { + so->currPos.moreLeft = false; + so->currPos.moreRight = true; + } + else + { + so->currPos.moreLeft = true; + so->currPos.moreRight = false; + }The lack of comments makes it hard for me to understand what the
motivation for this is, but I bet it's wrong. Suppose there's a
cursor involved here and the user tries to back up. Instead of having
a separate amskip operation, maybe there should be a flag attached to
a scan indicating whether it should return only distinct results.
Otherwise, you're allowing for the possibility that the same scan
might sometimes skip and other times not skip, but then it seems hard
for the scan to survive cursor operations. Anyway, why would that be
useful?
I assume that "sometimes skip and other times not skip" refers to the
situation, when we did fetch forward and jump something over, and then right
away doing fetch backwards, when we don't actually need to skip anything and
can get the result right away, right? If so, I can't find any comments about
why is it should be a problem for cursor operations?
Attachments:
v9-0001-Index-skip-scan.patchapplication/octet-stream; name=v9-0001-Index-skip-scan.patchDownload
From e562b12d866ed46c9eabfb6c283ec96ea9f6f7e0 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v9] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 +++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 ++++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 167 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 +++
src/backend/optimizer/util/pathnode.c | 39 ++++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 ++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 ++
37 files changed, 421 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..2db65aa362 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4322,6 +4322,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..461e4b00db 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 8f008dd008..639c8d7115 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..dc9d85a521 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,173 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ /* Check if the next unique key can be found within the current page */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..adf796a64f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..2d048a9725 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +520,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e15724bb0e..4b11ad5b40 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..1d62a2051e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5aa42242a9..9e38c7947e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1803,6 +1803,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..91a35e9beb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 236f506cfb..7dc169700e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -178,7 +178,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2754,7 +2755,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5030,7 +5032,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5045,6 +5048,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1a58d733fa..5e76af7353 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4734,6 +4734,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..5d526e941c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2884,6 +2884,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d6dc83ca80..9eb5aaca07 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -271,6 +271,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..0ea6a0296e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -908,6 +908,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..3cc641c5e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -347,6 +347,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..24be304768 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a008ae07da..4a4b0eeb25 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -820,6 +820,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1156,6 +1157,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1168,6 +1172,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..2c480bc9c8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..f982ddbfb6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..f6e95eae57 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Wed, Feb 20, 2019 at 11:33 AM Dmitry Dolgov <9erthalion6@gmail.com>
wrote:
On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <
jesper.pedersen@redhat.com> wrote:
Dmitry and I will look at this and take it into account for the next
version.In the meantime, just to not forget, I'm going to post another version
with a
fix for cursor fetch backwards, which was crashing before.
This version of the patch can return the wrong answer.
create index on pgbench_accounts (bid, aid);
begin; declare c cursor for select distinct on (bid) bid, aid from
pgbench_accounts order by bid, aid;
fetch 2 from c;
bid | aid
-----+---------
1 | 1
2 | 100,001
fetch backward 1 from c;
bid | aid
-----+---------
1 | 100,000
Without the patch, instead of getting a wrong answer, I get an error:
ERROR: cursor can only scan forward
HINT: Declare it with SCROLL option to enable backward scan.
If I add "SCROLL", then I do get the right answer with the patch.
Cheers,
Jeff
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com>
wrote in <CA+q6zcVP18wYiO=
aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)
I think that what we have proposed here is just an incomplete
implementation of what other products call a skip scan, not a fundamentally
different thing. They don't ignore the prefix part, they use that part in
a way to cancel itself out to give the same answer, but faster. I think
they would also use this skip method to get distinct values if that is what
is requested. But they would go beyond that to also use it to do something
similar to the plan we get with this:
Set up:
pgbench -i -s50
create index on pgbench_accounts (bid, aid);
alter table pgbench_accounts drop constraint pgbench_accounts_pkey ;
Query:
explain analyze with t as (select distinct bid from pgbench_accounts )
select pgbench_accounts.* from pgbench_accounts join t using (bid) where
aid=5;
If we accept this patch, I hope it would be expanded in the future to give
similar performance as the above query does even when the query is written
in its more natural way of:
explain analyze select * from pgbench_accounts where aid=5;
(which currently takes 200ms, rather than the 0.9ms taken for the one
benefiting from skip scan)
I don't think we should give it a different name, just because our initial
implementation is incomplete.
Or do you think our implementation of one feature does not really get us
closer to implementing the other?
Cheers,
Jeff
On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this:
Hi Jeff,
"Hop scan" was just a stupid joke that occurred to me when I saw that
DB2 had gone for "jump scan". I think "skip scan" is a perfectly good
name and it's pretty widely used by now (for example, by our friends
over at SQLite to blow us away at these kinds of queries).
Yes, simple distinct value scans are indeed just the easiest kind of
thing to do with this kind of scan-with-fast-forward. As discussed
already in this thread and the earlier one there is a whole family of
tricks you can do, and the thing that most people call an "index skip
scan" is indeed the try-every-prefix case where you can scan an index
on (a, b) given a WHERE clause b = x. Perhaps the simple distinct
scan could appear as "Distinct Index Skip Scan"? And perhaps the
try-every-prefix-scan could appear as just "Index Skip Scan"? Whether
these are the same executor node is a good question; at one point I
proposed a separate nest-loop like node for the try-every-prefix-scan,
but Robert shot that down pretty fast. I now suspect (as he said)
that all of this belongs in the index scan node, as different modes.
The behaviour is overlapping; for "Distinct Index Skip Scan" you skip
to each distinct prefix and emit one tuple, whereas for "Index Skip
Scan" you skip to each distinct prefix and then perform a regular scan
for the prefix + the suffix emitting matches.
(which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan)
Nice.
I don't think we should give it a different name, just because our initial implementation is incomplete.
+1
Or do you think our implementation of one feature does not really get us closer to implementing the other?
My question when lobbing the earlier sketch patches into the mailing
list a few years back was: is this simple index AM interface and
implementation (once debugged) powerful enough for various kinds of
interesting skip-based plans? So far I have the impression that it
does indeed work for Distinct Index Skip Scan (demonstrated), Index
Skip Scan (no one has yet tried that), and special cases of extrema
aggregate queries (foo, MIN(bar) can be performed by skip scan of
index on (foo, bar)), but may not work for the semi-related merge join
trickery mentioned in a paper posted some time back (though I don't
recall exactly why). Another question is whether it should all be
done by the index scan node, and I think the answer is yet.
--
Thomas Munro
https://enterprisedb.com
On 3/1/19 12:03 AM, Thomas Munro wrote:
On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this:
Hi Jeff,
"Hop scan" was just a stupid joke that occurred to me when I saw that
DB2 had gone for "jump scan". I think "skip scan" is a perfectly good
name and it's pretty widely used by now (for example, by our friends
over at SQLite to blow us away at these kinds of queries).
+1 to "hop scan"
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes <jeff.janes@gmail.com> wrote:
This version of the patch can return the wrong answer.
Yes, indeed. In fact it answers my previous question related to the backward
cursor scan, when while going back we didn't skip enough. Within the current
approach it can be fixed by proper skipping for backward scan, something like
in the attached patch.
Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.
If we accept this patch, I hope it would be expanded in the future to give
similar performance as the above query does even when the query is written in
its more natural way of:
Yeah, I hope the current approach with a new index am routine can be extended
for that.
Without the patch, instead of getting a wrong answer, I get an error:
Right, as far as I can see without a skip scan and SCROLL, a unique + index
scan is used, where amcanbackward is false by default. So looks like it's not
really patch related.
Attachments:
v10-0001-Index-skip-scan.patchapplication/octet-stream; name=v10-0001-Index-skip-scan.patchDownload
From 32aac0325358fd4f1bd0e632ed6af04ce3ffbe81 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v10] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 249 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 17 ++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 ++
src/backend/optimizer/util/pathnode.c | 39 ++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 503 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..395b7de7e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..461e4b00db 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 467d91e681..720696f84e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..771414f0da 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,255 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ if (BTScanPosIsValid(so->currPos))
+ {
+ /*_bt_drop_lock_and_maybe_pin(scan, &so->currPos);*/
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ }
+ else
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..adf796a64f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..2d048a9725 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. */
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +262,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +520,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 807393dfaa..129ece294a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9d44e3e4c6..74246147ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 43491e297b..b7e5ca7556 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1802,6 +1802,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b8d406f230..bb194b7e0e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1b4f7db649..e335fffe84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -173,7 +173,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2752,7 +2753,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5064,7 +5066,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5079,6 +5082,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2239728cf..e6118ee35e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4732,6 +4732,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b57de6b4c6..e2384ba896 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2799,6 +2799,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 3efa1bdc1a..e89f22c6bd 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98d75be292..376141e9d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -894,6 +894,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..834a775773 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -345,6 +345,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..24be304768 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d3c477a542..b9e7bb0036 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -815,6 +815,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1166,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1179,6 +1183,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index d0c8f99d0a..96b5262715 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -190,6 +190,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 46deb55c67..c9acae96d7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..588616446e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.
It turns out that `_bt_skip` was unnecessary applied every time when scan was
restarted from the beginning. Here is the fixed version of patch.
Attachments:
v11-0001-Index-skip-scan.patchapplication/octet-stream; name=v11-0001-Index-skip-scan.patchDownload
From 29d561b88625e618c770b78da40fc9922df140a8 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v11] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 245 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 16 ++
src/backend/optimizer/util/pathnode.c | 39 ++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 504 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..395b7de7e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..461e4b00db 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 467d91e681..720696f84e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..11b0a899d3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..80330c6fe7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,251 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ }
+ else
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..adf796a64f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->distinctPrefix > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->distinctPrefix,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->distinctPrefix > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b3f61dd1fc..6640e25163 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +267,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -505,6 +525,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 807393dfaa..129ece294a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(distinctPrefix);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9d44e3e4c6..74246147ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(distinctPrefix);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 43491e297b..b7e5ca7556 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1802,6 +1802,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(distinctPrefix);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b8d406f230..bb194b7e0e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1b4f7db649..e335fffe84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -173,7 +173,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2752,7 +2753,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5064,7 +5066,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipprefix)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5079,6 +5082,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->distinctPrefix = skipprefix;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2239728cf..e6118ee35e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4732,6 +4732,22 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys > 0)
+ {
+ Path *subpath = (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ list_length(root->distinct_pathkeys),
+ numDistinctRows);
+ add_path(distinct_rel, subpath);
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b57de6b4c6..e2384ba896 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2799,6 +2799,45 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(subpath, IndexPath));
+
+ /* We don't want to modify subpath, so make a copy. */
+ memcpy(pathnode, subpath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(numCols > 0);
+ pathnode->indexskipprefix = numCols;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = subpath->startup_cost;
+ pathnode->path.total_cost = subpath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 3efa1bdc1a..e89f22c6bd 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98d75be292..376141e9d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -894,6 +894,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..834a775773 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -345,6 +345,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..e74149d1a4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..24be304768 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1390,6 +1390,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1408,6 +1410,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_NumDistinctKeys;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d3c477a542..b9e7bb0036 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -815,6 +815,7 @@ typedef struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1166,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1179,6 +1183,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..632b05a84f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int distinctPrefix; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index d0c8f99d0a..96b5262715 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -190,6 +190,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 46deb55c67..c9acae96d7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..588616446e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com>
On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.It turns out that `_bt_skip` was unnecessary applied every time when scan was
restarted from the beginning. Here is the fixed version of patch.
nbtsearch.c: In function ‘_bt_skip’:
nbtsearch.c:1292:11: error: ‘struct IndexScanDescData’ has no member named ‘xs_ctup’; did you mean ‘xs_itup’?
scan->xs_ctup.t_self = currItem->heapTid;
Unfortunately a recent commit c2fe139c20 hit this.
Date: Mon Mar 11 12:46:41 2019 -0700
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com>
On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.It turns out that `_bt_skip` was unnecessary applied every time when scan was
restarted from the beginning. Here is the fixed version of patch.
I have some comments on the latest v11 patch.
L619:
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
The number of distinct prefix keys has various names in this
patch. They should be unified as far as possible.
L:728
+ root->distinct_pathkeys > 0)
It is not an integer, but a list.
L730:
+ Path *subpath = (Path *) + create_skipscan_unique_path(root,
The name "subpath" here is not a subpath, but it can be removed
by directly calling create_skipscan_unique_path in add_path.
L:758
+create_skipscan_unique_path(PlannerInfo *root, + RelOptInfo *rel, + Path *subpath, + int numCols,
The "subpath" is not a subpath. How about basepath, or orgpath?
The "numCols" doesn't makes clear sense. unique_prefix_keys?
L764:
+ IndexPath *pathnode = makeNode(IndexPath); + + Assert(IsA(subpath, IndexPath)); + + /* We don't want to modify subpath, so make a copy. */ + memcpy(pathnode, subpath, sizeof(IndexPath));
Why don't you just use copyObject()?
L773:
+ Assert(numCols > 0);
Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?
L586:
+ * Check if we need to skip to the next key prefix, because we've been + * asked to implement DISTINCT. + */ + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted) + { + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) + { + /* Reached end of index. At this point currPos is invalidated,
I thought a while on this bit. It seems that the lower layer must
know whether it has emitted the first tuple. So I think that this
code can be reduced as the follows.
if (node->ioss_NumDistinctKeys &&
!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
return ExecClearTupler(slot);
Then, the index_skip returns true with doing nothing if the
scandesc is in the initial state. (Of course other index AMs can
do something in the first call.) ioss_FirstTupleEmitted and the
comment can be removed.
By the way this patch seems to still be forgetting about the
explicit rescan case but just doing this makes such consideration
not required.
L1032:
+ Index Only Scan using tenk1_four on public.tenk1 + Output: four + Scan mode: Skip scan
The "Scan mode" has only one value and it is shown only for
"Index Only Scan" case. It seems to me that "Index Skip Scan"
implies Index Only Scan. How about just "Index Skip Scan"?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I have some comments on the latest v11 patch.
Thank you!
L619:
+ indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
The number of distinct prefix keys has various names in this
patch. They should be unified as far as possible.
Good point, I've renamed everything to skipPrefixSize, it seems for me that
this name should be self explanatory enough.
L:728
+ root->distinct_pathkeys > 0)
It is not an integer, but a list.
Thanks for noticing, fixed (via compare with NIL, since we just need to know if
this list is empty or not).
L730:
+ Path *subpath = (Path *) + create_skipscan_unique_path(root,The name "subpath" here is not a subpath, but it can be removed
by directly calling create_skipscan_unique_path in add_path.L:758
+create_skipscan_unique_path(PlannerInfo *root, + RelOptInfo *rel, + Path *subpath, + int numCols,The "subpath" is not a subpath. How about basepath, or orgpath?
The "numCols" doesn't makes clear sense. unique_prefix_keys?
I agree, suggested names sound good.
L764:
+ IndexPath *pathnode = makeNode(IndexPath); + + Assert(IsA(subpath, IndexPath)); + + /* We don't want to modify subpath, so make a copy. */ + memcpy(pathnode, subpath, sizeof(IndexPath));Why don't you just use copyObject()?
Maybe I'm missing something, but I don't see that copyObject works with path
nodes, does it? I've tried it with subpath directly and got `unrecognized node
type`.
L773:
+ Assert(numCols > 0);
Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?
Yeah, makes sense.
L586:
+ * Check if we need to skip to the next key prefix, because we've been + * asked to implement DISTINCT. + */ + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted) + { + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) + { + /* Reached end of index. At this point currPos is invalidated,I thought a while on this bit. It seems that the lower layer must
know whether it has emitted the first tuple. So I think that this
code can be reduced as the follows.if (node->ioss_NumDistinctKeys &&
!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
return ExecClearTupler(slot);Then, the index_skip returns true with doing nothing if the
scandesc is in the initial state. (Of course other index AMs can
do something in the first call.) ioss_FirstTupleEmitted and the
comment can be removed.
I'm not sure then, how to figure out when scandesc is in the initial state from
the inside index_skip without passing the node as an argument? E.g. in the
case, describe in the commentary, when we do fetch forward/fetch backward/fetch
forward again.
L1032:
+ Index Only Scan using tenk1_four on public.tenk1 + Output: four + Scan mode: Skip scanThe "Scan mode" has only one value and it is shown only for
"Index Only Scan" case. It seems to me that "Index Skip Scan"
implies Index Only Scan. How about just "Index Skip Scan"?
Do you mean, show "Index Only Scan", and then "Index Skip Scan" in details,
instead of "Scan mode", right?
Attachments:
v12-0001-Index-skip-scan.patchapplication/octet-stream; name=v12-0001-Index-skip-scan.patchDownload
From 13eff84d667f452e57bf8a38e516d6edc807ad9f Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v12] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 245 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 17 ++
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 506 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fe1735722a..0e255ad27f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4326,6 +4326,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 9943e8ecd4..776e42ca35 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 8f008dd008..639c8d7115 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef9..e44bc353b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ad30186d9..4f3774128b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -792,6 +793,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..134eda34ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..80330c6fe7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,251 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ }
+ else
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..c41995b247 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 26758e7703..dced29df04 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -114,6 +114,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -249,6 +267,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -499,6 +519,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8a735c247..17f71cae25 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a07c3..6f099d1ed4 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4b845b1bb7..88136560ad 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1804,6 +1804,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..91a35e9beb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9fbe5b2a5f..987afe7999 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -178,7 +178,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2754,7 +2755,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5030,7 +5032,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5045,6 +5048,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5e3a7120ff..a8c0ba4142 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4691,6 +4691,23 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ int distinctPrefixKeys = list_length(root->distinct_pathkeys);
+
+ add_path(distinct_rel,
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..140899b556 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2884,6 +2884,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 30f4dc151b..81852da830 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -271,6 +271,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fe6c6f8a05..dbe46f2959 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -905,6 +905,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..55da1df89c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -347,6 +347,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39496..a9bf4f58a9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -170,6 +170,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..56689cd1fb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fd13c170d7..2598462e71 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1393,6 +1393,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1411,6 +1413,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..653f3b174b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -819,6 +819,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1155,6 +1156,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1167,6 +1171,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d66a187a53..dbd4ca6216 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..2c480bc9c8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..f982ddbfb6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..f6e95eae57 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I have some comments on the latest v11 patch.Thank you!
In the meantime here is a new version, rebased after tableam changes.
Attachments:
v13-0001-Index-skip-scan.patchapplication/octet-stream; name=v13-0001-Index-skip-scan.patchDownload
From adb872f2094d19a06ff35d30bcd9a69747e6c6a1 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v13] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 245 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 17 ++
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 506 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..f637635438 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..fa8e60b121 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4340,6 +4340,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 9943e8ecd4..776e42ca35 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1221,6 +1221,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 8f008dd008..639c8d7115 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef9..e44bc353b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ae1c87ebad..9f7ebc633d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -756,6 +757,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 60e0b90ccf..769b830699 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b..a6a3b06e6d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1192,6 +1192,251 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ so->skipScanKey[i].sk_flags = flags;
+ so->skipScanKey[i].sk_argument = datum;
+ }
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, prefix,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir), &buf, BT_READ,
+ scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, buf, prefix, so->skipScanKey,
+ ScanDirectionIsForward(dir));
+ }
+ }
+ else
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ _bt_freeskey(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..c41995b247 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2d954b722a..3adfce24f5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c68bd7bcf7..5e2af32b73 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -514,6 +514,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a07c3..6f099d1ed4 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -572,6 +572,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4b845b1bb7..88136560ad 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1804,6 +1804,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..91a35e9beb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9fbe5b2a5f..987afe7999 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -178,7 +178,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2754,7 +2755,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5030,7 +5032,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5045,6 +5048,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..3d6a8f78b6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4682,6 +4682,23 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ int distinctPrefixKeys = list_length(root->distinct_pathkeys);
+
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..140899b556 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2884,6 +2884,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 30f4dc151b..81852da830 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -271,6 +271,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..6ffdac16fa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -905,6 +905,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..55da1df89c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -347,6 +347,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index cad66513f6..a694e9a19b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -172,6 +172,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..56689cd1fb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -470,6 +470,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ ScanKey skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -570,6 +573,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -597,6 +601,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 62eb1a06ee..f1df7a4369 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1393,6 +1393,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1411,6 +1413,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..653f3b174b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -819,6 +819,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1155,6 +1156,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1167,6 +1171,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d66a187a53..dbd4ca6216 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -431,6 +431,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..2c480bc9c8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..f982ddbfb6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..f6e95eae57 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Tue, Mar 19, 2019 at 2:07 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I have some comments on the latest v11 patch.Thank you!
In the meantime here is a new version, rebased after tableam changes.
Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
mechanical, just to make it work - I guess I'll need to try to rewrite some
parts of the patch, that don't look natural now, accordingly. And maybe to
leverage dynamic prefix truncation per Peter suggestion.
Attachments:
v14-0001-Index-skip-scan.patchapplication/octet-stream; name=v14-0001-Index-skip-scan.patchDownload
From 4376bda3887b3e9b5918acac19890413d4414adc Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v14] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 249 ++++++++++++++++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 17 ++
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 25 +++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 7 +
37 files changed, 510 insertions(+), 4 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d078dfbd46..c6ee2f5403 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..fa8e60b121 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4340,6 +4340,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 3493f482b8..729990ae26 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1233,6 +1233,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 8f008dd008..639c8d7115 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a746e911f3..5c938fa9fb 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ae1c87ebad..9f7ebc633d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -756,6 +757,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ac6f1eb342..ee14f4cb2e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7ed4e01bd3..bd90330654 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1378,6 +1378,255 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low, high, compare_offset;
+ Relation indexRel = scan->indexRelation;
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos))
+ {
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation,
+ so->skipScanKey, page, compare_offset) > compare_value)
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int i;
+
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ buf = so->currPos.buf;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ if(_bt_compare(scan->indexRelation, so->skipScanKey,
+ page, compare_offset) > compare_value)
+ {
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+ else
+ {
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..c41995b247 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2d954b722a..3adfce24f5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04cc15606d..41a006b5d2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -515,6 +515,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..59923159ef 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -573,6 +573,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index eff98febf1..90f9759a65 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1805,6 +1805,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..91a35e9beb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..ccecbfdd13 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -178,7 +178,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2773,7 +2774,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5049,7 +5051,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5064,6 +5067,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..3d6a8f78b6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4682,6 +4682,23 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ int distinctPrefixKeys = list_length(root->distinct_pathkeys);
+
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..42363f5d54 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2912,6 +2912,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 702c4f89b8..aec39e9d19 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..6ffdac16fa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -905,6 +905,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..55da1df89c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -347,6 +347,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index cad66513f6..a694e9a19b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -172,6 +172,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 473c6f2918..4fc0237389 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -654,6 +654,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -758,6 +761,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -782,6 +786,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 869c303e15..a3505dd6a7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1395,6 +1395,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1413,6 +1415,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..653f3b174b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -819,6 +819,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1155,6 +1156,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1167,6 +1171,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 24740c31e3..148b8672f7 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -432,6 +432,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..2c480bc9c8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index cc3dda4c70..9d8d55eea2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..38c9bc4b9b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,28 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 15c0f1f5d1..296b07c1d4 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..992e8d7c4d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,10 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
--
2.16.4
On Thu, Mar 28, 2019 at 11:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
mechanical, just to make it work - I guess I'll need to try to rewrite some
parts of the patch, that don't look natural now, accordingly.
Here is the updated version with the changes I was talking about (mostly about
readability and code cleanup). I've also added few tests for a cursor behaviour.
Attachments:
v15-0001-Index-skip-scan.patchapplication/octet-stream; name=v15-0001-Index-skip-scan.patchDownload
From ed1f7ba621721acbc4dd01060e4c4e07c7fcc709 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v15] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 17 ++
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 75 +++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 21 +++
37 files changed, 548 insertions(+), 5 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d078dfbd46..c6ee2f5403 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..fa8e60b121 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4340,6 +4340,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ea..1b12ad9493 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -666,6 +667,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 3493f482b8..729990ae26 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1233,6 +1233,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 8f008dd008..639c8d7115 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -108,6 +108,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index afc20232ac..36f32f15a4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a746e911f3..5c938fa9fb 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1f01a0956..07d7eeda56 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ae1c87ebad..9f7ebc633d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -756,6 +757,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ac6f1eb342..ee14f4cb2e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -130,6 +130,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -378,6 +379,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -445,6 +448,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7ed4e01bd3..f9ba919ef6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1378,6 +1381,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2247,3 +2428,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8e63c1fad2..a55bd5e9f5 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1831ea81cf..c41995b247 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1297,6 +1297,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1519,6 +1527,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2d954b722a..3adfce24f5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04cc15606d..41a006b5d2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -515,6 +515,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..59923159ef 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -573,6 +573,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index eff98febf1..90f9759a65 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1805,6 +1805,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..91a35e9beb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..ccecbfdd13 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -178,7 +178,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2773,7 +2774,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5049,7 +5051,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5064,6 +5067,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..3d6a8f78b6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4682,6 +4682,23 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ int distinctPrefixKeys = list_length(root->distinct_pathkeys);
+
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..42363f5d54 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2912,6 +2912,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 702c4f89b8..aec39e9d19 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -270,6 +270,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..6ffdac16fa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -905,6 +905,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..55da1df89c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -347,6 +347,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 653ddc976b..082a9bb0d6 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -127,6 +127,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -221,6 +225,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index cad66513f6..a694e9a19b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -172,6 +172,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 473c6f2918..4fc0237389 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -654,6 +654,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -758,6 +761,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -782,6 +786,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 869c303e15..a3505dd6a7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1395,6 +1395,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1413,6 +1415,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..653f3b174b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -819,6 +819,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1155,6 +1156,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1167,6 +1171,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 24740c31e3..148b8672f7 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -432,6 +432,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..33c7a0a376 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..2c480bc9c8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index cc3dda4c70..9d8d55eea2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..6ae000920c 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,78 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 15c0f1f5d1..296b07c1d4 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..3eeb4079db 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,24 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.16.4
On Sat, May 11, 2019 at 6:35 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Here is the updated version with the changes I was talking about (mostly about
readability and code cleanup). I've also added few tests for a cursor behaviour.
And one more cosmetic rebase after pg_indent.
Attachments:
v16-0001-Index-skip-scan.patchapplication/octet-stream; name=v16-0001-Index-skip-scan.patchDownload
From 33cb0fba60643ba67174044c43ef11c94fb51891 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v16] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planner.c | 17 ++
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 75 +++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 21 +++
37 files changed, 548 insertions(+), 5 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..592149f10e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..b9b707e9be 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..b7e9a1e949 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1f809c24a1..44d2a3caf8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1383,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2249,3 +2430,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..c750437cc6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1373,6 +1373,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1595,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index ee5b1c493b..e833b40841 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..421719cbb7 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -515,6 +515,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 237598e110..4ec900606f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -573,6 +573,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..5e431bec7b 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1806,6 +1806,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..b578dcca20 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -180,7 +180,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2904,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5179,7 +5181,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5197,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..3c63bee69a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4807,6 +4807,23 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if (IsA(path, IndexPath) &&
+ path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ int distinctPrefixKeys = list_length(root->distinct_pathkeys);
+
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..46520d5334 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 2405acbf6f..0f0bdad2ac 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64122bc1e3..5c4eb48bd6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1404,6 +1404,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1422,6 +1424,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..c6f7ab6956 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -829,6 +829,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1166,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1181,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..b5b7d62b70 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -432,6 +432,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c8bc6be061..69902d9cf3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..6ae000920c 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,78 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..3eeb4079db 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,24 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.16.4
After some talks with Jesper at PGCon about the Index Skip Scan, I started testing this patch, because it seems to have great potential in speeding up many of our queries (great conference by the way, really enjoyed my first time being there!). I haven't looked in depth to the code itself, but I focused on some testing with real data that we have.
Let me start by sketching our primary use case for this, as it is similar, but slightly different than what was discussed earlier in this thread. I think this use case is something a lot of people who handle timeseries data have. Our database has many tables with timeseries data. We don't update rows, but just insert new rows each time. One example of this would be a table with prices for instruments. Instruments are identified by a column called feedcode. Prices of instrument update with a certain frequency. Each time it updates we insert a new row with the new value and the timestamp at that time. So in the simplest form, you could see it as a table like this:
create table prices (feedcode text, updated_at timestamptz, value float8); -- there'll be some other columns as well, this is just an example
create unique index on prices (feedcode, updated_at desc);
This table perfectly fits the criteria for the index skip scan as there are relatively few distinct feedcodes, but each of them has quite a lot of price inserts (imagine 1000 distinct feedcodes, each of them having one price per second). We normally partition our data by a certain time interval, so let's say we're just looking at one day of prices here. We have other cases with higher update frequencies and/or more distinct values though.
Typical queries on this table involve querying the price at a certain point in time, or simply querying the latest update. If we know the feedcode, this is easy:
select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1
Unfortunately, this gets hard if you want to know the price of everything at a certain point in time. The query then becomes:
select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc
Up until now (even with this patch) this uses a regular index scan + a unique node which scans the full index, which is terribly slow and is also not constant - as the table grows it becomes slower and slower.
Obviously there are currently already ways to speed this up using the recursive loose index scan, but I think everybody agrees that those are pretty unreadable. However, since they're also several orders of magnitude faster, we actually use them everywhere. Eg.
-- certain point in time
-- first query *all* distinct feedcodes (disregarding time), then look do an index scan for every feedcode found to see if it has an update in the time window we're interested in
-- this essentially means we're doing 2*N index scans
with recursive t as (
select feedcode from prices order by feedcode, updated_at desc limit 1
union all
select n.feedcode from t
cross join lateral (select feedcode from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) n
) select n.* from t
cross join lateral (select * from prices where feedcode=t.feedcode and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1) n
-- just latest value
-- if we're interested in just the latest value, it can actually be optimized to just N index scans like this.
-- to make it even more confusing - there's a tradeoff here.. if you're querying a timestamp close to the latest available timestamp, it is often faster to use this method anyway and just put the filter for updated_at inside this query. this avoids the overhead of 2*N index scans, at the expense that the LIMIT 1 may have to scan several tuples before finding one that matches the timestamp criteria. With the 2*N method above we're guaranteed that the first tuple it sees is the correct tuple, but we're doing many more scans...
with recursive t as (
select * from prices order by feedcode, updated_at desc limit 1
union all
select n.* from t
cross join lateral (select * from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) _
) select * from t
I hope this makes our current situation clear. Please do ask if I need to elaborate on something here.
So what changes with this patch? The great thing is that the recursive CTE is not required anymore! This is a big win for readability and it helps performance as well. It makes everything much better. I am really happy with these results. If the index skip scan triggers, it is easily over 100x faster than the naive 'distinct on' query in earlier versions of Postgres. It is also quite a bit faster than the recursive CTE version of the query.
I have a few remarks though. I tested some of our queries with the patch and found that the following query would (with patch) work best now for arbitrary timestamps:
-- first query all distinct values using the index skip scan, then do an index scan for each of these
select r.* from (
select distinct feedcode from prices
) k
cross join lateral (
select *
from prices
where feedcode=k.feedcode and updated_at <= '2019-06-01 12:00'
order by feedcode, updated_at desc
limit 1
) r?
While certainly a big improvement over the recursive CTE, it would be nice if the even simpler form with the 'distinct on' would work out of the box using an index skip scan.
select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc
As far as I can see, there are two main problems with that at the moment.
1) Only support for Index-Only scan at the moment, not for regular index scans. This was already mentioned upthread and I can understand that it was left out until now to constrain the scope of this. However, if we were to support 'distinct on' + selecting columns that are not part of the index we need a regular index scan instead of the index only scan.
2) The complicating part that we're interested in the value 'at a specific point in time'. This comparison with updated_at messes up all efficiency, as the index scan first looks at the *latest* updated_at for a certain feedcode and then walks the tree until it finds a tuple that matches the updated_at criteria (which possibly never happens, in which case it will happily walk over the full index). I'm actually unsure if there is anything we can do about this (without adding a lot of complexity), aside from just rewriting the query itself in the way I did where we're doing the skip scan for all distinct items followed by a second round of index scans for the specific point in time. I'm struggling a bit to explain this part clearly - I hope it's clear though. Please let me know if I should elaborate. Perhaps it's easiest to see by the difference in speed between the following two queries:
select distinct feedcode from prices -- approx 10ms
select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms
Both use the index skip scan, but the first one is very fast, because it can skip large parts of the index. The second one scans the full index, because it never finds any row that matches the where condition so it can never skip anything.
Thanks for this great patch. It is already very useful and ?fills a gap that has existed for a long time. It is going to make our queries so much more readable and performant if we won't have to resort to recursive CTEs anymore!
-Floris
Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?
Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
The following query uses an index skip scan and returns just 1 row (incorrect!)
select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'
The following query still uses the regular index scan and returns many more rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'
It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.
-Floris
On Sat, 1 Jun 2019 at 06:10, Floris Van Nee <florisvannee@optiver.com> wrote:
Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?
I am willing to give it a try.
Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
The following query uses an index skip scan and returns just 1 row (incorrect!)
select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'The following query still uses the regular index scan and returns many more rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'
Aren't those two queries different?
select distinct on (market, feedcode) market, feedcode vs select
distinct on (market, feedcode)*
Anyhow, it's just the difference in projection so doesn't matter much.
I verified this scenario at my end and you are right, there is a bug.
Here is my repeatable test case,
create table t (market text, feedcode text, updated_at timestamptz,
value float8) ;
create index on t (market, feedcode, updated_at desc);
insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
'2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
insert into t values('TEST', 'jsgfhdfjd', (select timestamp
'2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
Now, without the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+-----------
TEST | abcdef
TEST | jsgfhdfjd
(2 rows)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
----------------------------------------------------------------
Unique (cost=12.20..13.21 rows=2 width=13)
-> Sort (cost=12.20..12.70 rows=201 width=13)
Sort Key: feedcode
-> Seq Scan on t (cost=0.00..4.51 rows=201 width=13)
Filter: (market = 'TEST'::text)
(5 rows)
And with the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+----------
TEST | abcdef
(1 row)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
------------------------------------------------------------------------------------------------
Index Only Scan using t_market_feedcode_updated_at_idx on t
(cost=0.14..0.29 rows=2 width=13)
Scan mode: Skip scan
Index Cond: (market = 'TEST'::text)
(3 rows)
Notice that in the explain statement it shows correct number of rows
to be skipped.
--
Regards,
Rafia Sabih
On Sat, Jun 1, 2019 at 6:10 AM Floris Van Nee <florisvannee@optiver.com> wrote:
After some talks with Jesper at PGCon about the Index Skip Scan, I started
testing this patch, because it seems to have great potential in speeding up
many of our queries (great conference by the way, really enjoyed my first
time being there!). I haven't looked in depth to the code itself, but I
focused on some testing with real data that we have.
Thanks!
Actually I'd like to add something to this. I think I've found a bug in the
current implementation. Would someone be able to check?The following query uses an index skip scan and returns just 1 row (incorrect!)
select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'The following query still uses the regular index scan and returns many more
rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'
Yes, good catch, I'll investigate. Looks like in the current implementation
something is not quite right, when we have this order of columns in an index
and where clause (e.g. in the examples above everything seems fine if we create
index over (feedcode, market) and not the other way around).
As far as I can see, there are two main problems with that at the moment.
1) Only support for Index-Only scan at the moment, not for regular index
scans. This was already mentioned upthread and I can understand that it
was left out until now to constrain the scope of this. However, if we were
to support 'distinct on' + selecting columns that are not part of the
index we need a regular index scan instead of the index only scan.
Sure, it's something I hope we can tackle as the next step.
select distinct feedcode from prices -- approx 10ms
select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms
Both use the index skip scan, but the first one is very fast, because it can
skip large parts of the index. The second one scans the full index, because
it never finds any row that matches the where condition so it can never skip
anything.
Interesting, I'll take a closer look.
Hi,
Thanks for the helpful replies.
Yes, good catch, I'll investigate. Looks like in the current implementation
something is not quite right, when we have this order of columns in an index
and where clause (e.g. in the examples above everything seems fine if we create
index over (feedcode, market) and not the other way around).
I did a little bit of investigation and it seems to occur because in pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if there is an equality condition with a constant in the corresponding WHERE clause.
* 1. If the new pathkey's equivalence class contains a constant, and isn't
* below an outer join, then we can disregard it as a sort key. An example:
* SELECT ... WHERE x = 42 ORDER BY x, y;
In planner.c it builds the list of distinct_pathkeys, which is then used for the index skip scan to skip over the first length(distinct_pathkeys) columns when it does a skip. In my query, the distinct_pathkeys list only contains 'feedcode' and not 'market', because 'market' was considered redundant due to the WHERE clause. However, the index skip scan interprets this as that it has to skip over just the first column.
We need to get this list of number of prefix columns to skip differently while building the plan. We need the 'real' number of distinct keys without throwing away the redundant ones. However, I'm not sure if this information can still be obtained while calling create_skipscan_unique_path? But I'm sure people here will have much better ideas than me about this :-)
-Floris
On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote:
I did a little bit of investigation and it seems to occur because in
pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
there is an equality condition with a constant in the corresponding WHERE
clause.
...
However, the index skip scan interprets this as that it has to skip over just
the first column.
Right, passing correct number of columns fixes this particular problem. But
while debugging I've also discovered another related issue, when the current
implementation seems to have a few assumptions, that are not correct if we have
an index condition and a distinct column is not the first in the index. I'll
try to address these in a next version of the patch in the nearest future.
Hi Floris,
On 6/1/19 12:10 AM, Floris Van Nee wrote:
Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
The following query uses an index skip scan and returns just 1 row (incorrect!)
select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'The following query still uses the regular index scan and returns many more rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.
Thanks for taking a look at the patch, and your feedback on it.
I'll def look into this once I'm back from my travels.
Best regards,
Jesper
Hi Rafia,
On 6/1/19 6:03 AM, Rafia Sabih wrote:
Here is my repeatable test case,
create table t (market text, feedcode text, updated_at timestamptz,
value float8) ;
create index on t (market, feedcode, updated_at desc);
insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
'2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
insert into t values('TEST', 'jsgfhdfjd', (select timestamp
'2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);Now, without the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+-----------
TEST | abcdef
TEST | jsgfhdfjd
(2 rows)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
----------------------------------------------------------------
Unique (cost=12.20..13.21 rows=2 width=13)
-> Sort (cost=12.20..12.70 rows=201 width=13)
Sort Key: feedcode
-> Seq Scan on t (cost=0.00..4.51 rows=201 width=13)
Filter: (market = 'TEST'::text)
(5 rows)And with the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+----------
TEST | abcdef
(1 row)explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
------------------------------------------------------------------------------------------------
Index Only Scan using t_market_feedcode_updated_at_idx on t
(cost=0.14..0.29 rows=2 width=13)
Scan mode: Skip scan
Index Cond: (market = 'TEST'::text)
(3 rows)Notice that in the explain statement it shows correct number of rows
to be skipped.
Thanks for your test case; this is very helpful.
For now, I would like to highlight that
SET enable_indexskipscan = OFF
can be used for testing with the patch applied.
Dmitry and I will look at the feedback provided.
Best regards,
Jesper
On Sat, Jun 1, 2019 at 6:57 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote:
I did a little bit of investigation and it seems to occur because in
pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
there is an equality condition with a constant in the corresponding WHERE
clause.
...
However, the index skip scan interprets this as that it has to skip over just
the first column.Right, passing correct number of columns fixes this particular problem. But
while debugging I've also discovered another related issue, when the current
implementation seems to have a few assumptions, that are not correct if we have
an index condition and a distinct column is not the first in the index. I'll
try to address these in a next version of the patch in the nearest future.
So, as mentioned above, there were a few problems, namely the number of
distinct_pathkeys with and without redundancy, and using _bt_search when the
order of distinct columns doesn't match the index. As far as I can see the
problem in the latter case (when we have an index condition) is that it's still
possible to find a value, but lastItem value after the search is always zero
(due to _bt_checkkeys filtering) and _bt_next stops right away.
To address this, probably we can do something like in the attached patch.
Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
the same, but without the constants elimination. It's being used then for
getting the real number of distinct keys, and to check the order of the columns
to not consider index skip scan if it's different. Hope it doesn't
look too hacky.
Also I've noticed, that the current implementation wouldn't work e.g. for:
select distinct a, a from table;
because in this case an IndexPath is hidden behind a ProjectionPath. For now I
guess it's fine, but probably it's possible here to apply skip scan too.
Attachments:
v17-0001-Index-skip-scan.patchapplication/octet-stream; name=v17-0001-Index-skip-scan.patchDownload
From 16664a868cd2b55a548ce7263d92934a19dfa9c0 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v17] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 ++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 83 ++++++++--
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 67 +++++++-
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 8 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 176 ++++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 55 +++++++
40 files changed, 811 insertions(+), 19 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..592149f10e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..b9b707e9be 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..b7e9a1e949 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1f809c24a1..44d2a3caf8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1383,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2249,3 +2430,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..c750437cc6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1373,6 +1373,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1595,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index ee5b1c493b..e833b40841 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..421719cbb7 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -515,6 +515,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 237598e110..14258372b5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -573,6 +573,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -2208,6 +2209,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..5e431bec7b 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1806,6 +1806,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..5e96058ab9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -94,6 +94,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +157,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1110,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..b578dcca20 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -180,7 +180,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2904,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5179,7 +5181,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5197,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..ed16b62dba 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4802,11 +4811,63 @@ create_distinct_paths(PlannerInfo *root,
if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
{
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool differentColumnsOrder = false;
+ int i = 0;
+
add_path(distinct_rel, (Path *)
create_upper_unique_path(root, distinct_rel,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if(IsA(path, IndexPath))
+ index = ((IndexPath *) path)->indexinfo;
+ else
+ continue;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ differentColumnsOrder = true;
+ break;
+ }
+
+ i++;
+ }
+
+ if (path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ index->amcanskip &&
+ root->distinct_pathkeys != NIL &&
+ !differentColumnsOrder)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 2405acbf6f..0f0bdad2ac 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64122bc1e3..5c4eb48bd6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1404,6 +1404,8 @@ typedef struct IndexScanState
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ioss_PscanLen Size of parallel index-only scan descriptor
* ----------------
*/
@@ -1422,6 +1424,8 @@ typedef struct IndexOnlyScanState
Relation ioss_RelationDesc;
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..e571c84473 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,9 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but not necessarily not
+ redundant distinctClause pathkeys,
+ if any */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +832,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1169,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1184,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..b5b7d62b70 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -432,6 +432,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c8bc6be061..69902d9cf3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..17a017dc4b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,179 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------
+ Index Only Scan using tenk1_four_ten on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+ Index Cond: (tenk1.ten = 2)
+(4 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+------------------------------------------------------
+ Index Only Scan using tenk1_four_ten on public.tenk1
+ Output: four, ten
+ Scan mode: Skip scan
+ Index Cond: (tenk1.four = 0)
+(4 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------------
+ Unique
+ Output: four
+ -> Index Only Scan using tenk1_ten_four on public.tenk1
+ Output: four
+ Index Cond: (tenk1.ten = 2)
+(5 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------------
+ Unique
+ Output: four, ten
+ -> Index Only Scan using tenk1_ten_four on public.tenk1
+ Output: four, ten
+ Index Cond: (tenk1.ten = 2)
+(5 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..ab7e7bd53c 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,58 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.16.4
To address this, probably we can do something like in the attached patch.
Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
the same, but without the constants elimination. It's being used then for
getting the real number of distinct keys, and to check the order of the columns
to not consider index skip scan if it's different. Hope it doesn't
look too hacky.
Thanks! I've verified that it works now.
I was wondering if we're not too strict in some cases now though. Consider the following queries:
postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_m_f_t_idx on t (cost=0.29..11.60 rows=40 width=5) (actual time=0.056..0.469 rows=10 loops=1)
Scan mode: Skip scan
Index Cond: (m = 'M2'::text)
Heap Fetches: 10
Planning Time: 0.095 ms
Execution Time: 0.490 ms
(6 rows)
postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 loops=1)
-> Index Only Scan using t_m_f_t_idx on t (cost=0.29..824.70 rows=10052 width=5) (actual time=0.087..8.524 rows=10000 loops=1)
Index Cond: (m = 'M2'::text)
Heap Fetches: 10000
Planning Time: 0.078 ms
Execution Time: 10.944 ms
(6 rows)
This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.
-Floris?
Import Notes
Reply to msg id not found: 1559763453006.85615@Optiver.com
Hi,
On 6/5/19 3:39 PM, Floris Van Nee wrote:
Thanks! I've verified that it works now.
Here is a rebased version.
I was wondering if we're not too strict in some cases now though. Consider the following queries:
[snip]
This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.
I think it would be good to get more feedback on the patch in general
before looking at further optimizations. We should of course fix any
bugs that shows up.
Thanks for your testing and feedback !
Best regards,
Jesper
Attachments:
v18-0001-Index-skip-scan.patchtext/x-patch; name=v18-0001-Index-skip-scan.patchDownload
From afebbc8a844b59d1037ac7fe66131cc3eabcb5ae Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Thu, 13 Jun 2019 09:04:14 -0400
Subject: [PATCH] Implementation of Index Skip Scan (see Loose Index Scan in
the wiki [1]) on top of IndexOnlyScan. To make it suitable for both
situations when there are small number of distinct values and significant
amount of distinct values the following approach is taken - instead of
searching from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 +
doc/src/sgml/indices.sgml | 24 ++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 +
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 12 +
src/backend/executor/nodeIndexonlyscan.c | 22 ++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++++++-
src/backend/optimizer/plan/createplan.c | 10 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 67 +++++-
src/backend/optimizer/util/pathnode.c | 40 ++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 4 +
src/include/nodes/pathnodes.h | 8 +
src/include/nodes/plannodes.h | 1 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 176 ++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 55 +++++
40 files changed, 812 insertions(+), 19 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..592149f10e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7d..328c17f13a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..b7e9a1e949 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1f809c24a1..44d2a3caf8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1383,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2249,3 +2430,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..c750437cc6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1373,6 +1373,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1595,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8a4d795d1a..15e2ff7b1b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..421719cbb7 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -515,6 +515,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..f99ed30632 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -573,6 +573,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -2208,6 +2209,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..5e431bec7b 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1806,6 +1806,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..af7d9c4270 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -94,6 +95,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +158,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1111,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..b578dcca20 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -180,7 +180,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2904,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -5179,7 +5181,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5197,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..ed16b62dba 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4802,11 +4811,63 @@ create_distinct_paths(PlannerInfo *root,
if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
{
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool differentColumnsOrder = false;
+ int i = 0;
+
add_path(distinct_rel, (Path *)
create_upper_unique_path(root, distinct_rel,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if(IsA(path, IndexPath))
+ index = ((IndexPath *) path)->indexinfo;
+ else
+ continue;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ differentColumnsOrder = true;
+ break;
+ }
+
+ i++;
+ }
+
+ if (path->pathtype == T_IndexOnlyScan &&
+ enable_indexskipscan &&
+ index->amcanskip &&
+ root->distinct_pathkeys != NIL &&
+ !differentColumnsOrder)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 2405acbf6f..0f0bdad2ac 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..01611fd411 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1405,6 +1405,8 @@ typedef struct IndexScanState
* ScanDesc index scan descriptor
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* PscanLen size of parallel index-only scan descriptor
* ----------------
*/
@@ -1424,6 +1426,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..e571c84473 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,9 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but not necessarily not
+ redundant distinctClause pathkeys,
+ if any */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +832,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1169,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1184,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..b5b7d62b70 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -432,6 +432,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c30e6738ba..91a8af3416 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..17a017dc4b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,179 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------
+ Index Only Scan using tenk1_four_ten on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+ Index Cond: (tenk1.ten = 2)
+(4 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+------------------------------------------------------
+ Index Only Scan using tenk1_four_ten on public.tenk1
+ Output: four, ten
+ Scan mode: Skip scan
+ Index Cond: (tenk1.four = 0)
+(4 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------------
+ Unique
+ Output: four
+ -> Index Only Scan using tenk1_ten_four on public.tenk1
+ Output: four
+ Index Cond: (tenk1.ten = 2)
+(5 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+------------------------------------------------------------
+ Unique
+ Output: four, ten
+ -> Index Only Scan using tenk1_ten_four on public.tenk1
+ Output: four, ten
+ Index Cond: (tenk1.ten = 2)
+(5 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+--------------------------------------------------
+ Index Only Scan using tenk1_four on public.tenk1
+ Output: four
+ Scan mode: Skip scan
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..ab7e7bd53c 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,58 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.21.0
I've previously noted upthread (along with several others), that I
don't see a good reason to limit this new capability to index only
scans. In addition to the reasons upthread, this also prevents using
the new feature on physical replicas since index only scans require
visibility map (IIRC) information that isn't safe to make assumptions
about on a replica.
That being said, it strikes me that this likely indicates an existing
architecture issue. I was discussing the problem at PGCon with Andres
and Heiki with respect to an index scan variation I've been working on
myself. In short, it's not clear to me why we want index only scans
and index scans to be entirely separate nodes, rather than optional
variations within a broader index scan node. The problem becomes even
more clear as we continue to add additional variants that lie on
different axis, since we end up with an ever multiplying number of
combinations.
In that discussion no one could remember why it'd been done that way,
but I'm planning to try to find the relevant threads in the archives
to see if there's anything in particular blocking combining them.
I generally dislike gating improvements like this on seemingly
tangentially related refactors, but I will make the observation that
adding the skip scan on top of such a refactored index scan node would
make this a much more obvious and complete win.
As I noted to Jesper at PGCon I'm happy to review the code in detail
also, but likely won't get to it until later this week or next week at
the earliest.
Jesper: Is there anything still on your list of things to change about
the patch? Or would now be a good time to look hard at the code?
James Coleman
On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Here is a rebased version.
Hi Jesper,
I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)
First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.
I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.
The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for. This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1]https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r.
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for(). The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join. The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.
I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys(). Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg. I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.
Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ ) However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.
Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.
I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:
# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 1000000
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
Scan mode: Skip scan
Heap Fetches: 1000
Planning Time: 5.616 ms
Execution Time: 21.791 ms
(5 rows)
# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.42..50104.43 rows=200 width=12) (actual
time=1.201..555.280 rows=1000 loops=1)
-> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43
rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
loops=1)
Planning Time: 0.102 ms
Execution Time: 555.407 ms
(4 rows)
[1]: https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi James,
On 6/13/19 11:40 PM, James Coleman wrote:
I've previously noted upthread (along with several others), that I
don't see a good reason to limit this new capability to index only
scans. In addition to the reasons upthread, this also prevents using
the new feature on physical replicas since index only scans require
visibility map (IIRC) information that isn't safe to make assumptions
about on a replica.That being said, it strikes me that this likely indicates an existing
architecture issue. I was discussing the problem at PGCon with Andres
and Heiki with respect to an index scan variation I've been working on
myself. In short, it's not clear to me why we want index only scans
and index scans to be entirely separate nodes, rather than optional
variations within a broader index scan node. The problem becomes even
more clear as we continue to add additional variants that lie on
different axis, since we end up with an ever multiplying number of
combinations.In that discussion no one could remember why it'd been done that way,
but I'm planning to try to find the relevant threads in the archives
to see if there's anything in particular blocking combining them.I generally dislike gating improvements like this on seemingly
tangentially related refactors, but I will make the observation that
adding the skip scan on top of such a refactored index scan node would
make this a much more obvious and complete win.
Thanks for your feedback !
As I noted to Jesper at PGCon I'm happy to review the code in detail
also, but likely won't get to it until later this week or next week at
the earliest.Jesper: Is there anything still on your list of things to change about
the patch? Or would now be a good time to look hard at the code?
It would be valuable to have test cases for your use-cases which works
now, or should work.
I revived Thomas' patch because it covered our use-cases and saw it as a
much needed feature.
Thanks again !
Best regards,
Jesper
Hi David,
On 6/14/19 3:19 AM, David Rowley wrote:
I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for. This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for(). The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join. The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys(). Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg. I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ ) However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.
Thank you very much for this feedback ! Will need to revise the patch
based on this.
I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 1000000
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
Scan mode: Skip scan
Heap Fetches: 1000
Planning Time: 5.616 ms
Execution Time: 21.791 ms
(5 rows)# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.42..50104.43 rows=200 width=12) (actual
time=1.201..555.280 rows=1000 loops=1)
-> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43
rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
loops=1)
Planning Time: 0.102 ms
Execution Time: 555.407 ms
(4 rows)[1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r
Ok, understood.
I have put the CF entry into "Waiting on Author".
Best regards,
Jesper
On Fri, Jun 14, 2019 at 9:20 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for. This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for(). The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join. The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys(). Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg. I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ ) However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.
Interesting, thanks for sharing this.
I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:
...
explain analyze select distinct on (a) a,b from abc order by a,b;
explain analyze select distinct on (a) a,b,c from abc order by a,b;
...
Yes, but I believe this limitation is not intrinsic to the idea of the patch,
and the very same approach can be used for IndexScan in the second example.
I've already prepared a new version to enable it for IndexScan with minimal
modifications, just need to rebase it on top of the latest changes and then
can post it. Although still there would be some limitations I guess (e.g. the
first thing I've stumbled upon is that an index scan with a filter wouldn't
work well, because checking qual causes with a filter happens after
ExecScanFetch)
On Sun, Jun 16, 2019 at 5:03 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:
...
explain analyze select distinct on (a) a,b from abc order by a,b;
explain analyze select distinct on (a) a,b,c from abc order by a,b;
...Yes, but I believe this limitation is not intrinsic to the idea of the patch,
and the very same approach can be used for IndexScan in the second example.
I've already prepared a new version to enable it for IndexScan with minimal
modifications, just need to rebase it on top of the latest changes and then
can post it. Although still there would be some limitations I guess (e.g. the
first thing I've stumbled upon is that an index scan with a filter wouldn't
work well, because checking qual causes with a filter happens after
ExecScanFetch)
Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.
Attachments:
v18-0001-Index-skip-scan.patchapplication/octet-stream; name=v18-0001-Index-skip-scan.patchDownload
From f75d034b59ea3bc2db600b9c27827b761b620676 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 15 Sep 2018 21:14:50 +0200
Subject: [PATCH v18] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan. To make it suitable for both situations when
there are small number of distinct values and significant amount of
distinct values the following approach is taken - instead of searching
from the root for every value we're searching for then first on the
current page, and then if not found continue searching from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Jesper Pedersen, and a bit adjusted by Dmitry Dolgov.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 ++
doc/src/sgml/indices.sgml | 24 +++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 ++
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 24 +++
src/backend/executor/nodeIndexonlyscan.c | 22 +++
src/backend/executor/nodeIndexscan.c | 22 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 83 ++++++++--
src/backend/optimizer/plan/createplan.c | 20 ++-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 80 ++++++++-
src/backend/optimizer/util/pathnode.c | 40 +++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 7 +
src/include/nodes/pathnodes.h | 8 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 209 ++++++++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 68 ++++++++
41 files changed, 918 insertions(+), 22 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..592149f10e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7d..328c17f13a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..f60534e1fb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1383,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2249,3 +2430,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..19a504c312 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1363,6 +1363,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ if (indexscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1381,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1598,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1615,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8a4d795d1a..15e2ff7b1b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..752c077d8b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -116,6 +116,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +128,24 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +168,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +926,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..98d7107caa 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..15200f2e3a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -2208,6 +2210,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..3652b244dd 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..5e96058ab9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -94,6 +94,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +157,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1110,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..e2c53a0d05 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2905,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2914,7 +2917,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5150,7 +5154,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5167,6 +5172,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
@@ -5179,7 +5185,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5201,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..e5fb3351d1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4802,11 +4811,76 @@ create_distinct_paths(PlannerInfo *root,
if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
{
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+
add_path(distinct_rel, (Path *)
create_upper_unique_path(root, distinct_rel,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Also consider a skip scan, if possible. */
+ if(IsA(path, IndexPath))
+ index = ((IndexPath *) path)->indexinfo;
+ else
+ continue;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ different_columns_order = true;
+ break;
+ }
+
+ i++;
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)
+ not_empty_qual = true;
+
+ if ((path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan) &&
+ enable_indexskipscan &&
+ index->amcanskip &&
+ root->distinct_pathkeys != NIL &&
+ !different_columns_order &&
+ !not_empty_qual)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 40f497660d..8c05b3bb5c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..08096ec68b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,9 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1429,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..e571c84473 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,9 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but not necessarily not
+ redundant distinctClause pathkeys,
+ if any */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +832,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1169,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1184,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..4cdee103ec 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c30e6738ba..91a8af3416 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..e79b9e8274 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,212 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Scan mode: Skip scan
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: Skip scan
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: Skip scan
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Scan mode: Skip scan
+(2 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..bbe1978305 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,71 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.16.4
Hi,
On 6/19/19 9:57 AM, Dmitry Dolgov wrote:
Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.
I made some minor changes, but I did move all the code in
create_distinct_paths() under enable_indexskipscan to limit the overhead
if skip scan isn't enabled.
Attached is v20, since the last patch should have been v19.
Best regards,
Jesper
Attachments:
v20-0001-Index-skip-scan.patchtext/x-patch; name=v20-0001-Index-skip-scan.patchDownload
From 4fd4bd601f510ccce858196c0e93d37aa2d0f20f Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Thu, 20 Jun 2019 07:42:24 -0400
Subject: [PATCH 1/2] Implementation of Index Skip Scan (see Loose Index Scan
in the wiki [1]) on top of IndexScan and IndexOnlyScan. To make it suitable
for both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for then
first on the current page, and then if not found continue searching from the
root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 16 ++
doc/src/sgml/indexam.sgml | 10 +
doc/src/sgml/indices.sgml | 24 ++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 ++
src/backend/access/nbtree/nbtree.c | 12 +
src/backend/access/nbtree/nbtsearch.c | 224 +++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 24 ++
src/backend/executor/nodeIndexonlyscan.c | 22 ++
src/backend/executor/nodeIndexscan.c | 22 ++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++++++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 79 +++++-
src/backend/optimizer/util/pathnode.c | 40 ++++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 7 +
src/include/nodes/pathnodes.h | 8 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 209 ++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 68 ++++++
41 files changed, 918 insertions(+), 22 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). This parameter requires
+ that <varname>enable_indexonlyscan</varname> is <literal>on</literal>.
+ The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..592149f10e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the
+ same index page, and if not found, restart the search by looking for a
+ larger value. This is much faster when the index has many equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7d..328c17f13a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..f60534e1fb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,7 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1383,184 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple. The current position is set so that a subsequent call
+ * to _bt_next will fetch the first tuple that differs in the leading 'prefix'
+ * keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ if (ScanDirectionIsForward(dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ /* For backward scan finding offnum is more involved. It is wrong to
+ * just use binary search, since we will find the last item from the
+ * sequence of equal items, and we need the first one. Otherwise e.g.
+ * backward cursor scan will return an incorrect value. */
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ else
+ offnum = OffsetNumberNext(offnum);
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2249,3 +2430,44 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..19a504c312 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1363,6 +1363,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ if (indexscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1381,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ if (indexonlyscan->skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL,
+ indexonlyscan->skipPrefixSize,
+ es);
+ }
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1598,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1615,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->skipPrefixSize > 0)
+ {
+ ExplainPropertyText("Scan mode", "Skip scan", es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8a4d795d1a..15e2ff7b1b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..752c077d8b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -116,6 +116,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +128,24 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +168,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +926,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->skipPrefixSize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..98d7107caa 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(skipPrefixSize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..15200f2e3a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(skipPrefixSize);
}
static void
@@ -2208,6 +2210,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..3652b244dd 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(skipPrefixSize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..af7d9c4270 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -94,6 +95,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +158,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1111,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..e2c53a0d05 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2905,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2914,7 +2917,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5150,7 +5154,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5167,6 +5172,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
@@ -5179,7 +5185,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5201,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->skipPrefixSize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..663be21597 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4807,6 +4816,70 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ (path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan) &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+
+ index = ((IndexPath *) path)->indexinfo;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ different_columns_order = true;
+ break;
+ }
+
+ i++;
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 40f497660d..8c05b3bb5c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..08096ec68b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,9 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * NumDistinctKeys number of keys for skip-based DISTINCT
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1429,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..e571c84473 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,9 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but not necessarily not
+ redundant distinctClause pathkeys,
+ if any */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +832,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1169,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1184,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..4cdee103ec 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int skipPrefixSize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5305b53cac..056de928fe 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..e79b9e8274 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,212 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Scan mode: Skip scan
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: Skip scan
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: Skip scan
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Scan mode: Skip scan
+(2 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..bbe1978305 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,71 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.21.0
The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not entirely right.
-Floris
drop table if exists a;
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a desc, b desc;
explain select distinct on (a) a,b from a order by a desc, b desc;
DROP TABLE
CREATE TABLE
INSERT 0 50000
CREATE INDEX
ANALYZE
a | b
---+-------
5 | 10000
5 | 1
4 | 1
3 | 1
2 | 1
1 | 1
(6 rows)
QUERY PLAN
---------------------------------------------------------------------------------
Index Only Scan Backward using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=8)
Scan mode: Skip scan
(2 rows)
On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee@optiver.com>
wrote:
The following sql statement seems to have incorrect results - some logic
in
the backwards scan is currently not entirely right.
Thanks for testing! You're right, looks like in the current implementation
in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.
Thanks for testing! You're right, looks like in the current implementation in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.
Thanks. Looking forward to it. I think I found some other strange behavior. Given the same table as in my previous e-mail, the following queries also return inconsistent results. I spent some time trying to debug it, but can't easily pinpoint the cause. It looks like it also skips over one value too much, my guess is during _bt_skippage call in _bt_skip?
Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess.
-- this one is OK
postgres=# select distinct on (a) a,b from a where b>1;
a | b
---+---
1 | 2
2 | 2
3 | 2
4 | 2
5 | 2
(5 rows)
-- this one is not OK, it seems to skip too much
postgres=# select distinct on (a) a,b from a where b=2;
a | b
---+---
1 | 2
3 | 2
5 | 2
(3 rows)
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee <florisvannee@optiver.com> wrote:
Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess.
Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.
The patch should be considered a very rough hack, for now. It leaks
memory like crazy. But I think that you'll find it helpful.
--
Peter Geoghegan
Attachments:
0012-Index-scan-positioning-DEBUG1-instrumentation.patchapplication/octet-stream; name=0012-Index-scan-positioning-DEBUG1-instrumentation.patchDownload
From d6d85dc5c1160e4e6eba61543ac6f4e35c2c196f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 17 Jun 2019 15:42:19 -0700
Subject: [PATCH 12/12] Index scan positioning DEBUG1 instrumentation
---
src/backend/access/nbtree/nbtree.c | 41 ++++
src/backend/access/nbtree/nbtsearch.c | 277 +++++++++++++++++++++++++-
src/backend/utils/adt/selfuncs.c | 2 +
3 files changed, 319 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..bf0bec795c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,6 +22,7 @@
#include "access/nbtxlog.h"
#include "access/relscan.h"
#include "access/xlog.h"
+#include "catalog/catalog.h"
#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
@@ -198,6 +199,10 @@ btinsert(Relation rel, Datum *values, bool *isnull,
bool result;
IndexTuple itup;
+ if (!IsCatalogRelation(rel))
+ elog(DEBUG1, "%s call to btinsert()",
+ RelationGetRelationName(rel));
+
/* generate an index tuple */
itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
itup->t_tid = *ht_ctid;
@@ -218,6 +223,10 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btgettuple()",
+ RelationGetRelationName(scan->indexRelation));
+
/* btree indexes are never lossy */
scan->xs_recheck = false;
@@ -293,6 +302,10 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
int64 ntids = 0;
ItemPointer heapTid;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btgetbitmap()",
+ RelationGetRelationName(scan->indexRelation));
+
/*
* If we have any array keys, initialize them.
*/
@@ -350,6 +363,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ if (!IsCatalogRelation(rel))
+ elog(DEBUG1, "%s call to btbeginscan()",
+ RelationGetRelationName(rel));
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -396,6 +413,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btrescan()",
+ RelationGetRelationName(scan->indexRelation));
+
/* we aren't holding any read locks, but gotta drop the pins */
if (BTScanPosIsValid(so->currPos))
{
@@ -455,6 +476,10 @@ btendscan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btendscan()",
+ RelationGetRelationName(scan->indexRelation));
+
/* we aren't holding any read locks, but gotta drop the pins */
if (BTScanPosIsValid(so->currPos))
{
@@ -491,6 +516,10 @@ btmarkpos(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btmarkpos()",
+ RelationGetRelationName(scan->indexRelation));
+
/* There may be an old mark with a pin (but no lock). */
BTScanPosUnpinIfPinned(so->markPos);
@@ -521,6 +550,10 @@ btrestrpos(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s call to btrestrpos()",
+ RelationGetRelationName(scan->indexRelation));
+
/* Restore the marked positions of any array keys */
if (so->numArrayKeys)
_bt_restore_array_keys(scan);
@@ -859,6 +892,10 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ if (!IsCatalogRelation(rel))
+ elog(DEBUG1, "%s call to btbulkdelete()",
+ RelationGetRelationName(rel));
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -900,6 +937,10 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ if (!IsCatalogRelation(info->index))
+ elog(DEBUG1, "%s call to btvacuumcleanup()",
+ RelationGetRelationName(info->index));
+
/*
* If btbulkdelete was called, we need not do anything, just return the
* stats from the latest btbulkdelete call. If it wasn't called, we might
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..e480d56ff2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,6 +17,7 @@
#include "access/nbtree.h"
#include "access/relscan.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/predicate.h"
@@ -67,6 +68,63 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
+static void
+print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel,
+ char *extra)
+{
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+ char *lkey_desc = NULL;
+ char *rkey_desc;
+
+ /* Avoid infinite recursion -- don't instrument catalog indexes */
+ if (!IsCatalogRelation(rel))
+ {
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(left, rel);
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(left, itupdesc, values, isnull);
+ rel->rd_index->indnkeyatts = natts;
+
+ /*
+ * Since the regression tests should pass when the instrumentation
+ * patch is applied, be prepared for BuildIndexValueDescription() to
+ * return NULL due to security considerations.
+ */
+ lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+ if (lkey_desc && right)
+ {
+ /*
+ * Revolting hack: modify tuple descriptor to have number of key
+ * columns actually present in caller's pivot tuples
+ */
+ natts = BTreeTupleGetNAtts(right, rel);
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(right, itupdesc, values, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ rkey_desc = BuildIndexValueDescription(rel, values, isnull);
+ elog(DEBUG1, "%s blk %u sk > %s, sk <= %s %s",
+ RelationGetRelationName(rel), blk, lkey_desc, rkey_desc,
+ extra);
+ pfree(rkey_desc);
+ }
+ else
+ elog(DEBUG1, "%s blk %u sk check %s %s",
+ RelationGetRelationName(rel), blk, lkey_desc, extra);
+
+ /* Cleanup */
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
+ if (lkey_desc)
+ pfree(lkey_desc);
+ }
+}
+
/*
* _bt_search() -- Search the tree for a particular scankey,
* or more precisely for the first leaf page it could be on.
@@ -113,6 +171,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
BlockNumber blkno;
BlockNumber par_blkno;
BTStack new_stack;
+ IndexTuple right;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -142,6 +201,40 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
offnum = _bt_binsrch(rel, key, *bufP);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
+
+ /*
+ * Every downlink is between two separator keys, provided you pretend
+ * that even rightmost pages have a positive infinity high key. The
+ * key to the left of the downlink is a strict lower bound for items
+ * that can be found by following the downlink, whereas the right
+ * separator is a <= bound.
+ */
+ if (offnum == PageGetMaxOffsetNumber(page))
+ {
+ /*
+ * XXX: This is correct even on rightmost page, since "high key"
+ * position item will be negative infinity item, which is printed
+ * blank. If you assume that even rightmost pages have a positive
+ * infinity high key (and don't expect the instrumentation of the
+ * tuple to say either positive or negative infinity) then it
+ * makes sense.
+ *
+ * An internal page with only one downlink is rare though possible
+ * (see comments above _bt_binsrch()). Note that even in that
+ * case there are two separators (positive and negative infinity).
+ */
+ itemid = PageGetItemId(page, P_HIKEY);
+ right = (IndexTuple) PageGetItem(page, itemid);
+ print_itup(BufferGetBlockNumber(*bufP), itup, right, rel,
+ "(<= separator is high key)");
+ }
+ else if (OffsetNumberNext(offnum) <= PageGetMaxOffsetNumber(page))
+ {
+ itemid = PageGetItemId(page, OffsetNumberNext(offnum));
+ right = (IndexTuple) PageGetItem(page, itemid);
+ print_itup(BufferGetBlockNumber(*bufP), itup, right, rel, "");
+ }
+
blkno = BTreeInnerTupleGetDownLink(itup);
par_blkno = BufferGetBlockNumber(*bufP);
@@ -274,6 +367,9 @@ _bt_moveright(Relation rel,
for (;;)
{
+ IndexTuple hikey;
+ ItemId itemid;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -305,14 +401,40 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ if (P_IGNORE(opaque))
{
/* step right one page */
+ elog(DEBUG1, "%s blk %u must move right because page is ignorable",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buf));
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /*
+ * Very unlikely to catch this -- repeated moving right at same
+ * point in index suggests corruption masked by moving right
+ */
+ itemid = PageGetItemId(page, P_HIKEY);
+ hikey = (IndexTuple) PageGetItem(page, itemid);
+ print_itup(BufferGetBlockNumber(buf), hikey, NULL, rel,
+ "high key move right");
+ /* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ /*
+ * No need to move right (common case), but report that to be
+ * consistent
+ */
+ itemid = PageGetItemId(page, P_HIKEY);
+ hikey = (IndexTuple) PageGetItem(page, itemid);
+ print_itup(BufferGetBlockNumber(buf), hikey, NULL, rel,
+ "high key no move right");
break;
+ }
}
if (P_IGNORE(opaque))
@@ -778,7 +900,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* never be satisfied (eg, x == 1 AND x > 2).
*/
if (!so->qual_ok)
+ {
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s preprocessing determined that keys are contradictory",
+ RelationGetRelationName(scan->indexRelation));
return false;
+ }
/*
* For parallel scans, get the starting page from shared state. If the
@@ -788,6 +915,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (scan->parallel_scan != NULL)
{
+ /* XXX: No instrumentation for parallel scans */
status = _bt_parallel_seize(scan, &blkno);
if (!status)
return false;
@@ -986,6 +1114,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
bool match;
+ if (!IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s sk could not be formed, so descending to %s leaf page in whole index",
+ RelationGetRelationName(scan->indexRelation),
+ ScanDirectionIsForward(dir) ? "leftmost" : "rightmost");
+
+ /*
+ * Note that _bt_endpoint() will call _bt_readpage() -- it will be
+ * called, though not from usual place
+ */
match = _bt_endpoint(scan, dir);
if (!match)
@@ -1008,6 +1145,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
for (i = 0; i < keysCount; i++)
{
ScanKey cur = startKeys[i];
+ Oid typOutput;
+ bool varlenatype;
+ char *val;
Assert(cur->sk_attno == i + 1);
@@ -1032,6 +1172,27 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
+ /* Report row comparison header's first insertion scankey entry */
+ if (!IsCatalogRelation(rel))
+ {
+ if (subkey->sk_subtype != InvalidOid)
+ getTypeOutputInfo(subkey->sk_subtype,
+ &typOutput, &varlenatype);
+ else
+ getTypeOutputInfo(rel->rd_opcintype[i],
+ &typOutput, &varlenatype);
+ val = OidOutputFunctionCall(typOutput, subkey->sk_argument);
+ if (val)
+ {
+ elog(DEBUG1, "%s sk subkey attr %d val: %s (%s, %s)",
+ RelationGetRelationName(rel), subkey->sk_attno, val,
+ (subkey->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+ (subkey->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+
+ pfree(val);
+ }
+ }
+
/*
* If the row comparison is the last positioning key we accepted,
* try to add additional keys from the lower-order row members.
@@ -1065,6 +1226,32 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
memcpy(inskey.scankeys + keysCount, subkey,
sizeof(ScanKeyData));
keysCount++;
+
+ /*
+ * The need to separately report additional
+ * subkeys-as-insertion-scankey is an artifact of the way
+ * extra keys are added here more or less as an
+ * opportunistic optimization used when row comparison is
+ * the last positioning key.
+ */
+ if (!IsCatalogRelation(rel))
+ {
+ if (subkey->sk_subtype != InvalidOid)
+ getTypeOutputInfo(subkey->sk_subtype,
+ &typOutput, &varlenatype);
+ else
+ getTypeOutputInfo(rel->rd_opcintype[i],
+ &typOutput, &varlenatype);
+ val = OidOutputFunctionCall(typOutput, subkey->sk_argument);
+ if (val)
+ {
+ elog(DEBUG1, "%s sk extra subkey attr %d val: %s (%s, %s)",
+ RelationGetRelationName(rel), subkey->sk_attno, val,
+ (subkey->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+ (subkey->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+ pfree(val);
+ }
+ }
if (subkey->sk_flags & SK_ROW_END)
{
used_all_subkeys = true;
@@ -1139,6 +1326,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
cmp_proc,
cur->sk_argument);
}
+
+ /*
+ * Most index scans have insertion scan key entries reported here
+ */
+ if (!IsCatalogRelation(rel))
+ {
+ if (!(cur->sk_flags & SK_ISNULL))
+ {
+ if (cur->sk_subtype != InvalidOid)
+ getTypeOutputInfo(cur->sk_subtype,
+ &typOutput, &varlenatype);
+ else
+ getTypeOutputInfo(rel->rd_opcintype[i],
+ &typOutput, &varlenatype);
+ val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+ if (val)
+ {
+ elog(DEBUG1, "%s sk attr %d val: %s (%s, %s)",
+ RelationGetRelationName(rel), i, val,
+ (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+ (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+ pfree(val);
+ }
+ }
+ else
+ {
+ elog(DEBUG1, "%s sk attr %d val: NULL (%s, %s)",
+ RelationGetRelationName(rel), i,
+ (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+ (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+ }
+ }
}
}
@@ -1241,6 +1460,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
inskey.scantid = NULL;
inskey.keysz = keysCount;
+ /* Report additional insertion scan key details */
+ if (!IsCatalogRelation(rel))
+ elog(DEBUG1, "%s searching tree with %d keys, nextkey=%d, goback=%d",
+ RelationGetRelationName(rel), inskey.keysz, inskey.nextkey,
+ goback);
+
/*
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
@@ -1297,6 +1522,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
if (goback)
offnum = OffsetNumberPrev(offnum);
+ /* Report _bt_readpage()'s starting offset */
+ if (!IsCatalogRelation(rel))
+ elog(DEBUG1, "%s blk %u initial leaf page offset is %u out of %lu",
+ RelationGetRelationName(rel),
+ BufferGetBlockNumber(buf), offnum,
+ PageGetMaxOffsetNumber(BufferGetPage(buf)));
+
/* remember which buffer we have pinned, if any */
Assert(!BTScanPosIsValid(so->currPos));
so->currPos.buf = buf;
@@ -1500,6 +1732,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
+ /* Report which offset/item terminated index scan */
+ if (!continuescan && !IsCatalogRelation(scan->indexRelation))
+ elog(DEBUG1, "%s blk %u non-pivot offnum %u ended forward scan",
+ RelationGetRelationName(scan->indexRelation),
+ BufferGetBlockNumber(so->currPos.buf), offnum);
+
/*
* We don't need to visit page to the right when the high key
* indicates that no more matches will be found there.
@@ -1519,6 +1757,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+
+ /*
+ * Report if high key check was effective. This is also a
+ * reasonably useful way of indicating the progress of a large
+ * range scan that visits many leaf pages.
+ */
+ if (continuescan)
+ print_itup(BufferGetBlockNumber(so->currPos.buf), itup,
+ NULL, scan->indexRelation,
+ "continuescan high key check did not end forward scan");
+ else
+ print_itup(BufferGetBlockNumber(so->currPos.buf), itup,
+ NULL, scan->indexRelation,
+ "continuescan high key check ended forward scan");
+ }
+ else if (continuescan && P_RIGHTMOST(opaque) &&
+ !IsCatalogRelation(scan->indexRelation))
+ {
+ /*
+ * Report that range scan reached end of entire index -- this
+ * won't be caught by above non-pivot elog().
+ */
+ elog(DEBUG1, "%s blk %u non-pivot offnum %u (last in whole index) ended forward scan",
+ RelationGetRelationName(scan->indexRelation),
+ BufferGetBlockNumber(so->currPos.buf), offnum);
}
if (!continuescan)
@@ -1587,6 +1850,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberPrev(offnum);
}
+ if (!IsCatalogRelation(scan->indexRelation))
+ {
+ if (!continuescan)
+ elog(DEBUG1, "%s blk %u non-pivot offnum %u ended backwards scan",
+ RelationGetRelationName(scan->indexRelation),
+ BufferGetBlockNumber(so->currPos.buf), offnum);
+ else
+ elog(DEBUG1, "%s blk %u backwards scan must continue to left sibling",
+ RelationGetRelationName(scan->indexRelation),
+ BufferGetBlockNumber(so->currPos.buf));
+ }
+
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d7e3f09f1a..dbcb2e2c27 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5126,6 +5126,7 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
/* No hope if no relation or it doesn't have indexes */
if (rel == NULL || rel->indexlist == NIL)
return false;
+ elog(DEBUG1, "*** Begin get_actual_variable_range() ***");
/* If it has indexes it must be a plain relation */
rte = root->simple_rte_array[rel->relid];
Assert(rte->rtekind == RTE_RELATION);
@@ -5336,6 +5337,7 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
}
}
+ elog(DEBUG1, "*** End get_actual_variable_range() ***");
return have_data;
}
--
2.17.1
Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.
Thanks! That works quite nicely.
I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,10000), (2, 1), (2,2), ... (2,10000) until (5,10000).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on this, it finds the first tuple larger than this: (2,1). This is not the tuple that it stops at though, because after that it does this:
if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);
....
/* Now read the data */
if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))
First, it takes the previous tuple with OffsetNumberPrev (so tuple before (2,1), which is (1,10000)). This is done, because if this tuple were to be returned, there would be a call to _bt_next afterwards, which would then conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes things up, because it only reads tuples that match all the provided keys (so where b=2). The only tuple it'll return is (2,2). This will be the tuple that is set, however, on the call to _bt_next, the tuple is first incremented, so we'll find (2,3) there which doesn't match our keys. This leads it to skip (2,2) in our result set.
I was wondering about something else: don't we also have another problem with updating this current index tuple by skipping before calling btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples when scan->kill_prior_tuple is true. I'm not too familiar with the concept of killing dead tuples while doing index scans, but by looking at the code it seems to be possible that btgettuple returns a tuple, caller processes it and sets kill_prior_tuple to true in order to have it killed. However, then the skip scan kicks in, which sets the current tuple to a completely different tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my reasoning correct here or am I missing something?
-Floris
Hi,
I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:
1) config.sgml
I'm not sure why the enable_indexskipscan section says
This parameter requires that <varname>enable_indexonlyscan</varname>
is <literal>on</literal>.
I suppose it's the same thing as for enable_indexscan, and we don't say
anything like that for that GUC.
2) indices.sgml
The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?
3) nbtsearch.c
_bt_skip - comments are formatted incorrectly
_bt_update_skip_scankeys - missing comment
_bt_scankey_within_page - missing comment
4) explain.c
There are duplicate blocks of code for IndexScan and IndexOnlyScan:
if (indexscan->skipPrefixSize > 0)
{
if (es->format != EXPLAIN_FORMAT_TEXT)
ExplainPropertyInteger("Distinct Prefix", NULL,
indexscan->skipPrefixSize,
es);
}
I suggest we wrap this into a function ExplainIndexSkipScanKeys() or
something like that.
Also, there's this:
if (((IndexScan *) plan)->skipPrefixSize > 0)
{
ExplainPropertyText("Scan mode", "Skip scan", es);
}
That does not make much sense - there's just a single 'scan mode' value.
So I suggest we do the same thing as for unique joins, i.e.
ExplainPropertyBool("Skip Scan",
(((IndexScan *) plan)->skipPrefixSize > 0),
es);
5) nodeIndexOnlyScan.c
In ExecInitIndexOnlyScan, we should initialize the ioss_ fields a bit
later, with the existing ones. This is just cosmetic issue, though.
6) nodeIndexScan.c
I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?
7) pathnode.c
I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.
8) execnodes.h
The comment before IndexScanState mentions new field NumDistinctKeys,
but there's no such field added to the struct.
9) pathnodes.h
I don't understand what the uniq_distinct_pathkeys comment says :-(
10) plannodes.h
The naming of the new field (skipPrefixSize) added to IndexScan and
IndexOnlyScan is clearly inconsistent with the naming convention of the
existing fields.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote:
However, _bt_readpage messes things up, because it only reads tuples that
match all the provided keys (so where b=2)
Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.
I was wondering about something else: don't we also have another problem with
updating this current index tuple by skipping before calling
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
killing dead tuples while doing index scans, but by looking at the code it
seems to be possible that btgettuple returns a tuple, caller processes it and
sets kill_prior_tuple to true in order to have it killed. However, then the
skip scan kicks in, which sets the current tuple to a completely different
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
reasoning correct here or am I missing something?
Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.
On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:
Thank you!
2) indices.sgml
The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?
Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).
6) nodeIndexScan.c
I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?
Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.
7) pathnode.c
I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.
Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.
9) pathnodes.h
I don't understand what the uniq_distinct_pathkeys comment says :-(
Yeah, sorry, I'll try to improve the commentaries in the next version, where
I'm going to address all the feedback.
On Mon, Jun 24, 2019 at 01:44:14PM +0200, Dmitry Dolgov wrote:
On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote:
However, _bt_readpage messes things up, because it only reads tuples that
match all the provided keys (so where b=2)Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.I was wondering about something else: don't we also have another problem with
updating this current index tuple by skipping before calling
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
killing dead tuples while doing index scans, but by looking at the code it
seems to be possible that btgettuple returns a tuple, caller processes it and
sets kill_prior_tuple to true in order to have it killed. However, then the
skip scan kicks in, which sets the current tuple to a completely different
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
reasoning correct here or am I missing something?Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:Thank you!
2) indices.sgml
The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).6) nodeIndexScan.c
I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.
Yes, if that's the case then various bits of docs and comments are rather
misleading, ant fields in IndexScanState should be named 'iss_'.
7) pathnode.c
I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.
Right, it's mostly about what happens when the group sizes are not close
to average size. The question is what happens in such cases - how much
slower will the plan be, compared to "current" plan without a skip scan?
I don't have a very good idea of the additional overhead associated with
skip-scans - presumably it's a bit more expensive, right?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Attached is v20, since the last patch should have been v19.
I took this for a quick spin today. The DISTINCT ON support is nice
and I think it will be very useful. I've signed up to review it and
will have more to say later. But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.
1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node. If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases. I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does. I think it's probably better to consider
that in the grouping planner proper instead. I'm not sure.
2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards. Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20). That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().
--
Thomas Munro
https://enterprisedb.com
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
I took this for a quick spin today. The DISTINCT ON support is nice
and I think it will be very useful. I've signed up to review it and
will have more to say later. But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node. If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases. I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does. I think it's probably better to consider
that in the grouping planner proper instead. I'm not sure.
I think to make that happen we'd need to do a bit of an overhaul in
nodeAgg.c to allow it to make use of presorted results instead of
having the code blindly sort rows for each aggregate that has a
DISTINCT or ORDER BY. The planner would also then need to start
requesting paths with pathkeys that suit the aggregate and also
probably dictate the order the AggRefs should be evaluated to allow
all AggRefs to be evaluated that can be for each sort order. Once
that part is done then the aggregates could then also request paths
with certain "UniqueKeys" (a feature I mentioned in [1]/messages/by-id/CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com), however we'd
need to be pretty careful with that one since there may be other parts
of the query that require that all rows are fed in, not just 1 row per
value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
just feed through 1 row for each "i" value, since we need only the
ones that have "z > 0". Getting the first part of this solved is much
more important than making skip scans work here, I'd say. I think we
need to be able to walk before we can run with DISTINCT / ORDER BY
aggs.
2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards. Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20). That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().
For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.
[1]: /messages/by-id/CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi Thomas and David,
Thanks for the feedback !
On 7/2/19 8:27 AM, David Rowley wrote:
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
I took this for a quick spin today. The DISTINCT ON support is nice
and I think it will be very useful. I've signed up to review it and
will have more to say later. But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node. If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases. I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does. I think it's probably better to consider
that in the grouping planner proper instead. I'm not sure.I think to make that happen we'd need to do a bit of an overhaul in
nodeAgg.c to allow it to make use of presorted results instead of
having the code blindly sort rows for each aggregate that has a
DISTINCT or ORDER BY. The planner would also then need to start
requesting paths with pathkeys that suit the aggregate and also
probably dictate the order the AggRefs should be evaluated to allow
all AggRefs to be evaluated that can be for each sort order. Once
that part is done then the aggregates could then also request paths
with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
need to be pretty careful with that one since there may be other parts
of the query that require that all rows are fed in, not just 1 row per
value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
just feed through 1 row for each "i" value, since we need only the
ones that have "z > 0". Getting the first part of this solved is much
more important than making skip scans work here, I'd say. I think we
need to be able to walk before we can run with DISTINCT / ORDER BY
aggs.
I agree that the above is outside of scope for the first patch -- I
think the goal should be the simple use-cases for IndexScan and
IndexOnlyScan.
Maybe we should expand [1]https://wiki.postgresql.org/wiki/Loose_indexscan with possible cases, so we don't lose track
of the ideas.
2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards. Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20). That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
Ok.
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.
I'm looking at this, and will keep this in mind.
Thanks !
[1]: https://wiki.postgresql.org/wiki/Loose_indexscan
Best regards,
Jesper
On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.
Is that actually true, though? I can see unique {a, b, c} => unique
{a, b}, but for example:
a | b | c
--|---|--
1 | 2 | 3
1 | 2 | 4
1 | 2 | 5
is unique on {a, b, c} but not on {a, b}, at least as I understand the
way "unique" is used here, which is 3 distinct {a, b, c}, but only one
{a, b}.
Or I could be missing something obvious, and in that case, please
ignore.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote:
On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.Is that actually true, though? I can see unique {a, b, c} => unique
{a, b}, but for example:a | b | c
--|---|--
1 | 2 | 3
1 | 2 | 4
1 | 2 | 5is unique on {a, b, c} but not on {a, b}, at least as I understand the
way "unique" is used here, which is 3 distinct {a, b, c}, but only one
{a, b}.Or I could be missing something obvious, and in that case, please
ignore.
I think that example is the opposite direction of what David (Rowley)
is saying. Unique on {a, b} implies unique on {a, b, c} while you're
correct that the inverse doesn't hold.
Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
different from pathkeys.
James Coleman
On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote:
On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote:
On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.Is that actually true, though? I can see unique {a, b, c} => unique
{a, b}, but for example:a | b | c
--|---|--
1 | 2 | 3
1 | 2 | 4
1 | 2 | 5is unique on {a, b, c} but not on {a, b}, at least as I understand the
way "unique" is used here, which is 3 distinct {a, b, c}, but only one
{a, b}.Or I could be missing something obvious, and in that case, please
ignore.I think that example is the opposite direction of what David (Rowley)
is saying. Unique on {a, b} implies unique on {a, b, c} while you're
correct that the inverse doesn't hold.Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
different from pathkeys.
Yeah, exactly. A superset of the unique columns is still unique.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 04, 2019 at 10:06:11AM +1200, David Rowley wrote:
On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote:
I think that example is the opposite direction of what David (Rowley)
is saying. Unique on {a, b} implies unique on {a, b, c} while you're
correct that the inverse doesn't hold.Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
different from pathkeys.Yeah, exactly. A superset of the unique columns is still unique.
Thanks for clarifying!
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Jul 3, 2019 at 12:27 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards. Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20). That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
desirable. I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row. Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.
--
Thomas Munro
https://enterprisedb.com
On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee@optiver.com> wrote:
The following sql statement seems to have incorrect results - some logic in
the backwards scan is currently not entirely right.Thanks for testing! You're right, looks like in the current implementation in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.
Here is finally a new version of the patch, where all the mentioned issues
seems to be fixed and the corresponding new tests should keep it like that
(I've skipped all the pubs at PostgresLondon for that). Also I've addressed the
most of feedback from Tomas, except the points about planning improvements
(which is still in our todo list). By no means it's a final result (e.g. I
guess `_bt_read_closest` must be improved), but I hope making progress
step-by-step will help anyway. Also I've fixed some, how it is popular to say
here, brain fade, where I mixed up scan directions.
On Tue, Jul 2, 2019 at 11:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:Attached is v20, since the last patch should have been v19.
I took this for a quick spin today. The DISTINCT ON support is nice
and I think it will be very useful. I've signed up to review it and
will have more to say later. But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.On Thu, Jul 4, 2019 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.
Great, thank you! I agree with Jesper that probably some parts of this are
outside of scope for the first patch, but we definitely can take a look at what
needs to be done to make the current implementation more flexible, so the
follow up would be just natural.
Attachments:
v21-0001-Index-skip-scan.patchapplication/octet-stream; name=v21-0001-Index-skip-scan.patchDownload
From fcda76e179de5cb4ec7e9309c96887fe097bc93d Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v21] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 10 +
doc/src/sgml/indices.sgml | 24 ++
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 16 +
src/backend/access/nbtree/nbtree.c | 12 +
src/backend/access/nbtree/nbtsearch.c | 451 +++++++++++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 ++
src/backend/executor/nodeIndexonlyscan.c | 22 ++
src/backend/executor/nodeIndexscan.c | 22 ++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 79 ++++-
src/backend/optimizer/util/pathnode.c | 40 +++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 5 +
src/include/access/genam.h | 1 +
src/include/access/nbtree.h | 5 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 250 ++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 87 +++++
41 files changed, 1210 insertions(+), 22 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9644b9f8cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,15 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan.
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7d..328c17f13a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..1c2def162c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,21 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..3e50abd6b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+ return _bt_skip(scan, direction, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..1c32a24cc1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,6 +28,8 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
+static bool _bt_read_closest(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
@@ -37,7 +39,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1385,139 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_read_closest(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Now read the data */
+ if (!_bt_read_closest(scan, dir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -1596,6 +1734,268 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
return (so->currPos.firstItem <= so->currPos.lastItem);
}
+/*
+ * _bt_read_closest() -- Load data from closest two items, previous and
+ * current on one the current index page into so->currPos
+ *
+ * Similar to _bt_readpage, except that it reads only a current and a
+ * previous item. So far it is being used for _bt_skip.
+ *
+ * Returns true if required two matching items found on the page, false
+ * otherwise.
+ */
+static bool
+_bt_read_closest(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ IndexTuple prevItup = NULL;
+ OffsetNumber prevOffNum;
+
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions, so remember it */
+
+ if (prevItup == NULL)
+ {
+ _bt_saveitem(so, 0, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ _bt_saveitem(so, 0, prevOffNum, prevItup);
+ itemIndex++;
+
+ _bt_saveitem(so, 1, offnum, itup);
+ itemIndex++;
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ so->currPos.itemIndex = 0;
+ so->currPos.lastItem = 2;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ prevOffNum = offnum;
+ prevItup = itup;
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ IndexTuple prevItup = NULL;
+ OffsetNumber prevOffNum;
+
+ /* load items[] in descending order */
+ itemIndex = MaxIndexTuplesPerPage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (prevItup == NULL)
+ {
+ _bt_saveitem(so, MaxIndexTuplesPerPage - 1, offnum, itup);
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+ }
+ else
+ {
+ _bt_saveitem(so, MaxIndexTuplesPerPage - 1, prevOffNum, prevItup);
+ _bt_saveitem(so, MaxIndexTuplesPerPage - 2, offnum, itup);
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = MaxIndexTuplesPerPage - 2;
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 2;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+
+ itemIndex--;
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ prevOffNum = offnum;
+ prevItup = itup;
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ }
+
+ /* Not all of the closest items were found */
+ return false;
+}
+
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2249,3 +2649,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..1010280c71 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8a4d795d1a..12772cfb63 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -115,6 +115,24 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +271,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +523,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..6a256e5925 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -116,6 +116,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +128,24 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+ {
+ if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +168,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +926,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..8bb0b3eaee 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..44286a86e8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2208,6 +2210,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..45354a0b95 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..af7d9c4270 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -94,6 +95,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +158,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1111,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..e4acdec0e0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2905,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2914,7 +2917,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5150,7 +5154,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5167,6 +5172,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5179,7 +5185,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5201,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..663be21597 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4807,6 +4816,70 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ (path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan) &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+
+ index = ((IndexPath *) path)->indexinfo;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ different_columns_order = true;
+ break;
+ }
+
+ i++;
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 40f497660d..8c05b3bb5c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..34033c5486 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir, int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +229,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..2e79098b85 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..247cdb8127 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +805,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..df82c5d6dd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..eeff4a2935 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +834,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1171,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1186,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..72b4681613 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5305b53cac..056de928fe 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f7b9120539 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,253 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 100) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-----
+ 5 | 100
+ 4 | 100
+ 3 | 100
+ 2 | 100
+ 1 | 100
+(5 rows)
+
+DROP TABLE distinct_a;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..5fec91dbc2 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,90 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 100) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+DROP TABLE distinct_a;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.16.4
Hi,
On 7/4/19 6:59 AM, Thomas Munro wrote:
For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
David, are you thinking about something like the attached ?
Some questions.
* Do you see UniqueKey as a "complete" planner node ?
- I didn't update the nodes/*.c files for this yet
* Is a UniqueKey with a list of EquivalenceClass best, or a list of
UniqueKey with a single EquivalenceClass
Likely more questions around this coming -- should this be a separate
thread ?
Based on this I'll start to update the v21 patch to use UniqueKey, and
post a new version.
While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
desirable. I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row. Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.
Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.
Feedback greatly appreciated.
Best regards,
Jesper
Attachments:
uniquekey.txttext/plain; charset=UTF-8; name=uniquekey.txtDownload
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4b9e141404..2e07db2e6e 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,44 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
}
+/*
+ * print_unique_key -
+ * unique_key an UniqueKey
+ */
+void
+print_unique_key(const UniqueKey *unique_key, const List *rtable)
+{
+ ListCell *l;
+
+ printf("(");
+ foreach(l, unique_key->eq_clauses)
+ {
+ EquivalenceClass *eclass = (EquivalenceClass *) lfirst(l);
+ ListCell *k;
+ bool first = true;
+
+ /* chase up */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ printf("(");
+ foreach(k, eclass->ec_members)
+ {
+ EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+ if (first)
+ first = false;
+ else
+ printf(", ");
+ print_expr((Node *) mem->em_expr, rtable);
+ }
+ printf(")");
+ if (lnext(l))
+ printf(", ");
+ }
+ printf(")\n");
+}
+
/*
* print_tl
* print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
- joinpath.o joinrels.o pathkeys.o tidpath.o
+ joinpath.o joinrels.o pathkeys.o tidpath.o uniquekey.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..a8c8fe8a30 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3957,6 +3957,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
+ if (path->unique_key)
+ {
+ for (i = 0; i < indent; i++)
+ printf("\t");
+ printf(" unique_key: ");
+ print_unique_key(path->unique_key, root->parse->rtable);
+ }
+
if (join)
{
JoinPath *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..dbd0bbf3dc 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -705,6 +705,11 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.parallel_aware = true;
}
+ /* Consider cost based on unique key */
+ if (path->path.unique_key)
+ {
+ }
+
/*
* Now interpolate based on estimated index order correlation to get total
* disk I/O cost for main table accesses.
diff --git a/src/backend/optimizer/path/uniquekey.c b/src/backend/optimizer/path/uniquekey.c
new file mode 100644
index 0000000000..b4b9432ce5
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekey.c
@@ -0,0 +1,64 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/optimizer/path/uniquekey.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "nodes/pg_list.h"
+
+/*
+ * Build a unique key, if the index query has some defined
+ */
+UniqueKey*
+build_index_uniquekey(PlannerInfo *root, List *pathkeys)
+{
+ UniqueKey *unique_key = NULL;
+ ListCell *l;
+
+ if (pathkeys)
+ {
+ /* Find unique keys and add them to the list */
+ foreach(l, pathkeys)
+ {
+ ListCell *k;
+ PathKey *pk = (PathKey *) lfirst(l);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ while (ec->ec_merged)
+ ec = ec->ec_merged;
+
+ if (root->distinct_pathkeys)
+ {
+ foreach(k, root->distinct_pathkeys)
+ {
+ PathKey *pk = (PathKey *) lfirst(k);
+ EquivalenceClass *dec = pk->pk_eclass;
+
+ while (dec->ec_merged)
+ dec = dec->ec_merged;
+
+ if (ec == dec)
+ {
+ if (!unique_key)
+ unique_key = makeNode(UniqueKey);
+
+ unique_key->eq_clauses = lappend(unique_key->eq_clauses, ec);
+ }
+ }
+ }
+ }
+ }
+
+ return unique_key;
+}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..3f3aa6e57c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -966,6 +966,7 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = parallel_workers;
pathnode->pathkeys = NIL; /* seqscan has unordered result */
+ pathnode->unique_key = NULL;
cost_seqscan(pathnode, root, rel, pathnode->param_info);
@@ -990,6 +991,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* samplescan has unordered result */
+ pathnode->unique_key = NULL;
cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -1044,6 +1046,7 @@ create_index_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = 0;
pathnode->path.pathkeys = pathkeys;
+ pathnode->path.unique_key = build_index_uniquekey(root, pathkeys);
pathnode->indexinfo = index;
pathnode->indexclauses = indexclauses;
@@ -1087,6 +1090,7 @@ create_bitmap_heap_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = parallel_degree;
pathnode->path.pathkeys = NIL; /* always unordered */
+ pathnode->path.unique_key = NULL;
pathnode->bitmapqual = bitmapqual;
@@ -1947,6 +1951,7 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = pathkeys;
+ pathnode->unique_key = NULL;
cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1973,6 +1978,7 @@ create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_key = NULL;
cost_tablefuncscan(pathnode, root, rel, pathnode->param_info);
@@ -1999,6 +2005,7 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_key = NULL;
cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -2024,6 +2031,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* XXX for now, result is always unordered */
+ pathnode->unique_key = NULL;
cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -2050,6 +2058,7 @@ create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_key = NULL;
cost_namedtuplestorescan(pathnode, root, rel, pathnode->param_info);
@@ -2076,6 +2085,7 @@ create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_key = NULL;
cost_resultscan(pathnode, root, rel, pathnode->param_info);
@@ -2102,6 +2112,7 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_key = NULL;
/* Cost is the same as for a regular CTE scan */
cost_ctescan(pathnode, root, rel, pathnode->param_info);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..a9b67c64f8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -261,6 +261,7 @@ typedef enum NodeTag
T_EquivalenceMember,
T_PathKey,
T_PathTarget,
+ T_UniqueKey,
T_RestrictInfo,
T_IndexClause,
T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..4ac6207705 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1071,6 +1071,15 @@ typedef struct ParamPathInfo
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+ NodeTag type;
+
+ List *eq_clauses; /* equivalence class */
+} UniqueKey;
/*
* Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1100,6 +1109,9 @@ typedef struct ParamPathInfo
*
* "pathkeys" is a List of PathKey nodes (see above), describing the sort
* ordering of the path's output rows.
+ *
+ * "unique_key", if not NULL, is a UniqueKey node (see above),
+ * describing the XXX.
*/
typedef struct Path
{
@@ -1123,6 +1135,8 @@ typedef struct Path
List *pathkeys; /* sort ordering of path's output */
/* pathkeys is a List of PathKey nodes; see above */
+
+ UniqueKey *unique_key; /* the unique key, or NULL if none */
} Path;
/* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index cbff56a724..4ac359a50f 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -15,6 +15,7 @@
#define PRINT_H
#include "executor/tuptable.h"
+#include "nodes/pathnodes.h"
#define nodeDisplay(x) pprint(x)
@@ -28,6 +29,7 @@ extern char *pretty_format_node_dump(const char *dump);
extern void print_rt(const List *rtable);
extern void print_expr(const Node *expr, const List *rtable);
extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_unique_key(const UniqueKey *unique_key, const List *rtable);
extern void print_tl(const List *tlist, const List *rtable);
extern void print_slot(TupleTableSlot *slot);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..f13d826717 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -235,4 +235,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+/*
+ * uniquekey.c
+ * Utilities for matching and building a unique key
+ */
+extern UniqueKey *build_index_uniquekey(PlannerInfo *root, List *pathkeys);
+
#endif /* PATHS_H */
On Wed, Jul 10, 2019 at 1:32 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
desirable. I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row. Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.
I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in
the current patch is SKIP_FIRST behaviour. Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top. I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().
--
Thomas Munro
https://enterprisedb.com
Hi,
On 7/9/19 10:14 PM, Thomas Munro wrote:
Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in
the current patch is SKIP_FIRST behaviour. Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top. I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().
Ok, great.
Thanks for your feedback !
Best regards,
Jesper
Here is finally a new version of the patch, where all the mentioned issues
seems to be fixed and the corresponding new tests should keep it like that
(I've skipped all the pubs at PostgresLondon for that).
Thanks for the new patch! Really appreciate the work you're putting into it.
I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.
Also, I think in planner.c, line 4831, we should actually be looking at whether uniq_distinct_pathkeys is NIL, rather than the current check for distinct_pathkeys. That'll make the planner pick the skip scan even with queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. This leads to it scanning every item for a=2 even though it can stop after the first one.
I'll do some more tests with the patch.
-Floris
On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.
On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
David, are you thinking about something like the attached ?
Some questions.
* Do you see UniqueKey as a "complete" planner node ?
- I didn't update the nodes/*.c files for this yet* Is a UniqueKey with a list of EquivalenceClass best, or a list of
UniqueKey with a single EquivalenceClass
Just for me to clarify, the idea is to replace PathKeys with a new concept of
"UniqueKey" for skip scans, right? If I see it correctly, of course
UniqueKeys { x, y } == UniqueKeys { y, x }
from the result point of view, but the execution costs could be different due
to different values distribution. In fact there are efforts to utilize this to
produce more optimal order [1]/messages/by-id/7c79e6a5-8597-74e8-0671-1c39d124c9d6@sigaev.ru, but with UniqueKeys concept this information is
lost. Obviously it's not something that could be immediately (or even never) a
problem for skip scan functionality, but I guess it's still worth to point it
out.
On Wed, Jul 10, 2019 at 4:40 PM Floris Van Nee <florisvannee@optiver.com> wrote:
I verified that the backwards index scan is indeed functioning now. However,
I'm afraid it's not that simple, as I think the cursor case is broken now.
Thanks for testing! Could you provide a test case to show what exactly is the
problem?
[1]: /messages/by-id/7c79e6a5-8597-74e8-0671-1c39d124c9d6@sigaev.ru
Thanks for testing! Could you provide a test case to show what exactly is the
problem?
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
set enable_indexskipscan=1; // setting this to 0 yields different results
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;
BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;
FETCH FROM c;
FETCH BACKWARD FROM c;
FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;
FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;
END;
On Wed, Jul 10, 2019 at 4:52 PM Floris Van Nee <florisvannee@optiver.com> wrote:
Thanks for testing! Could you provide a test case to show what exactly is the
problem?create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;set enable_indexskipscan=1; // setting this to 0 yields different results
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;FETCH FROM c;
FETCH BACKWARD FROM c;FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;END;
Ok, give me a moment, I'll check.
Thanks for testing! Could you provide a test case to show what exactly is the
problem?
Note that in the case of a regular non-skip scan, this cursor backwards works because the Unique node on top does not support backwards scanning at all. Therefore, when creating the cursor, the actual plan actually contains a Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' never reaches the the index scan therefore, as it just fetches stuff from the materialize node.
-Floris
On Wed, Jul 10, 2019 at 5:00 PM Floris Van Nee <florisvannee@optiver.com> wrote:
Thanks for testing! Could you provide a test case to show what exactly is the
problem?Note that in the case of a regular non-skip scan, this cursor backwards works
because the Unique node on top does not support backwards scanning at all.
Therefore, when creating the cursor, the actual plan actually contains a
Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards'
never reaches the the index scan therefore, as it just fetches stuff from the
materialize node.
Yeah, I'm aware. The last time when I was busy with cursors I've managed to
make it work as I wanted, so at that time I decided to keep it like that, even
though without skip scan it wasn't doing backwards.
On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.
Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST
argument to amskip() that I theorised about, to support DISTINCT ON.
Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to
drop DISTINCT ON support, or use Materialize for some cases. My vote
is: let's move forwards and add that parameter and make DISTINCT ON
work.
--
Thomas Munro
https://enterprisedb.com
On Thu, 11 Jul 2019 at 14:50, Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.
Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST
argument to amskip() that I theorised about, to support DISTINCT ON.
Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to
drop DISTINCT ON support, or use Materialize for some cases. My vote
is: let's move forwards and add that parameter and make DISTINCT ON
work.
Does it not just need to know the current direction of the cursor's
scroll, then also the intended scan direction?
For the general forward direction but for a backwards cursor scroll,
we'd return the lowest value for each distinct prefix, but for the
general backwards direction (DESC case) we'd return the highest value
for each distinct prefix. Looking at IndexNext() the cursor direction
seems to be estate->es_direction and the general scan direction is
indicated by the plan's indexorderdir. Can't we just pass both of
those to index_skip() to have it decide what to do? If we also pass in
indexorderdir then index_skip() should know if it's to return the
highest or lowest value, right?
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
For the general forward direction but for a backwards cursor scroll,
we'd return the lowest value for each distinct prefix, but for the
general backwards direction (DESC case) we'd return the highest value
for each distinct prefix. Looking at IndexNext() the cursor direction
seems to be estate->es_direction and the general scan direction is
indicated by the plan's indexorderdir. Can't we just pass both of
those to index_skip() to have it decide what to do? If we also pass in
indexorderdir then index_skip() should know if it's to return the
highest or lowest value, right?
Correct, with these two values correct behavior can be deduced. The implementation of this is a bit cumbersome though. Consider a case like:
SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):
1,1,1
1,1,2
1,2,2
1,2,3
2,2,1
2,2,3
3,1,1
3,1,2
3,2,2
3,2,3
Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end.
If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2).
Perhaps anyone can think of an easier way to implement it?
I do think being able to use DISTINCT ON is very useful and it's worth the extra complications. In the future we can add even more useful skipping features to it, for example:
SELECT DISTINCT ON (a) * FROM a WHERE b =2
After skipping to the next prefix of column a, we can start a new search for (a,b)=(prefix,2) to avoid having to move one-by-one from the start of the prefix to the first matching element. There are many other useful optimizations possible. That won't have to be for this patch though :-)
-Floris
On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote:
SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):
1,1,1
1,1,2
1,2,2
1,2,3
2,2,1
2,2,3
3,1,1
3,1,2
3,2,2
3,2,3Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end.
If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2).
Perhaps anyone can think of an easier way to implement it?
One option is just don't implement it and just change
ExecSupportsBackwardScan() so that it returns false for skip index
scans, or perhaps at least implement an index am method to allow the
planner to be able to determine if the index implementation supports
it... amcanskipbackward
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 11 Jul 2019 at 02:48, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
David, are you thinking about something like the attached ?
Some questions.
* Do you see UniqueKey as a "complete" planner node ?
- I didn't update the nodes/*.c files for this yet* Is a UniqueKey with a list of EquivalenceClass best, or a list of
UniqueKey with a single EquivalenceClassJust for me to clarify, the idea is to replace PathKeys with a new concept of
"UniqueKey" for skip scans, right? If I see it correctly, of courseUniqueKeys { x, y } == UniqueKeys { y, x }
from the result point of view, but the execution costs could be different due
to different values distribution. In fact there are efforts to utilize this to
produce more optimal order [1], but with UniqueKeys concept this information is
lost. Obviously it's not something that could be immediately (or even never) a
problem for skip scan functionality, but I guess it's still worth to point it
out.
The UniqueKeys idea is quite separate from pathkeys. Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.
As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().
As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c
With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.
It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.
I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined. Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }. That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version. This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).
I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi David,
On 7/11/19 7:38 AM, David Rowley wrote:
The UniqueKeys idea is quite separate from pathkeys. Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.cWith this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined. Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }. That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version. This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.
Thanks for the feedback ! I'll work on these changes for the next
uniquekey patch.
Best regards,
Jesper
Hi David,
On 7/11/19 7:38 AM, David Rowley wrote:
The UniqueKeys idea is quite separate from pathkeys. Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.cWith this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined. Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }. That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version. This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.
Here is a patch more in that direction.
Some questions:
1) Do we really need the UniqueKey struct ? If it only contains the
EquivalenceClass field then we could just have a list of those instead.
That would make the patch simpler.
2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
the patch only uses canon_uniquekeys because the we attach the list
directly on the Path node.
I'll clean the patch up based on your feedback, and then start to rebase
v21 on it.
Thanks in advance !
Best regards,
Jesper
Attachments:
v2_uniquekey.txttext/plain; charset=UTF-8; name=v2_uniquekey.txtDownload
From 174a6425036e2d4ca7d3d68c635cd55a58a9b9e6 Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH] UniqueKey
---
src/backend/nodes/print.c | 39 +++++++
src/backend/optimizer/path/Makefile | 2 +-
src/backend/optimizer/path/allpaths.c | 8 ++
src/backend/optimizer/path/costsize.c | 5 +
src/backend/optimizer/path/indxpath.c | 39 +++++++
src/backend/optimizer/path/uniquekey.c | 149 +++++++++++++++++++++++++
src/backend/optimizer/plan/planner.c | 12 +-
src/backend/optimizer/util/pathnode.c | 12 ++
src/include/nodes/nodes.h | 1 +
src/include/nodes/pathnodes.h | 18 +++
src/include/nodes/print.h | 2 +-
src/include/optimizer/pathnode.h | 1 +
src/include/optimizer/paths.h | 8 ++
13 files changed, 293 insertions(+), 3 deletions(-)
create mode 100644 src/backend/optimizer/path/uniquekey.c
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..ed5684bf19 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
}
+/*
+ * print_unique_keys -
+ * unique_key an UniqueKey
+ */
+void
+print_unique_keys(const List *unique_keys, const List *rtable)
+{
+ ListCell *l;
+
+ printf("(");
+ foreach(l, unique_keys)
+ {
+ UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+ EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+ ListCell *k;
+ bool first = true;
+
+ /* chase up */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ printf("(");
+ foreach(k, eclass->ec_members)
+ {
+ EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+ if (first)
+ first = false;
+ else
+ printf(", ");
+ print_expr((Node *) mem->em_expr, rtable);
+ }
+ printf(")");
+ if (lnext(unique_keys, l))
+ printf(", ");
+ }
+ printf(")\n");
+}
+
/*
* print_tl
* print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
- joinpath.o joinrels.o pathkeys.o tidpath.o
+ joinpath.o joinrels.o pathkeys.o tidpath.o uniquekey.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e9ee32b7f4..acd22653c2 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3957,6 +3957,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
+ if (path->unique_keys)
+ {
+ for (i = 0; i < indent; i++)
+ printf("\t");
+ printf(" unique_keys: ");
+ print_unique_keys(path->unique_keys, root->parse->rtable);
+ }
+
if (join)
{
JoinPath *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3a9a994733..62d7815a76 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -705,6 +705,11 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.parallel_aware = true;
}
+ /* Consider cost based on unique key */
+ if (path->path.unique_keys)
+ {
+ }
+
/*
* Now interpolate based on estimated index order correlation to get total
* disk I/O cost for main table accesses.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5f339fdfde..f053ee6794 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -189,6 +189,7 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static List *get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys);
/*
@@ -874,6 +875,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
+ List *useful_uniquekeys;
bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
@@ -1036,11 +1038,14 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
if (index_clauses != NIL || useful_pathkeys != NIL || useful_predicate ||
index_only_scan)
{
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1063,6 +1068,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1093,11 +1099,14 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
index_pathkeys);
if (useful_pathkeys != NIL)
{
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -1115,6 +1124,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -3369,6 +3379,35 @@ match_clause_to_ordering_op(IndexOptInfo *index,
return clause;
}
+/*
+ * get_uniquekeys_for_index
+ */
+static List *
+get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys)
+{
+ ListCell *lc;
+
+ if (pathkeys)
+ {
+ List *uniquekeys = NIL;
+ foreach(lc, pathkeys)
+ {
+ UniqueKey *unique_key;
+ PathKey *pk = (PathKey *) lfirst(lc);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ unique_key = makeNode(UniqueKey);
+ unique_key->eq_clause = ec;
+
+ lappend(uniquekeys, unique_key);
+ }
+
+ if (uniquekeys_contained_in(root->canon_uniquekeys, uniquekeys))
+ return uniquekeys;
+ }
+
+ return NIL;
+}
/****************************************************************************
* ---- ROUTINES TO DO PARTIAL INDEX PREDICATE TESTS ----
diff --git a/src/backend/optimizer/path/uniquekey.c b/src/backend/optimizer/path/uniquekey.c
new file mode 100644
index 0000000000..9eecaef56b
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekey.c
@@ -0,0 +1,149 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/optimizer/path/uniquekey.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "nodes/pg_list.h"
+
+static UniqueKey *make_canonical_uniquekey(PlannerInfo *root, EquivalenceClass *eclass);
+static List* build_uniquekeys(PlannerInfo *root, List *pathkeys);
+
+/*
+ * Build unique keys for GROUP BY
+ */
+List*
+build_group_uniquekeys(PlannerInfo *root)
+{
+ return build_uniquekeys(root, root->group_pathkeys);
+}
+
+/*
+ * Build unique keys for DISTINCT
+ */
+List*
+build_distinct_uniquekeys(PlannerInfo *root)
+{
+ return build_uniquekeys(root, root->distinct_pathkeys);
+}
+
+/*
+ * uniquekeys_contained_in
+ * Are the keys2 included in the keys1 superset
+ */
+bool
+uniquekeys_contained_in(List *keys1, List *keys2)
+{
+ ListCell *key1,
+ *key2;
+
+ /*
+ * Fall out quickly if we are passed two identical lists. This mostly
+ * catches the case where both are NIL, but that's common enough to
+ * warrant the test.
+ */
+ if (keys1 == keys2)
+ return true;
+
+ foreach(key2, keys2)
+ {
+ bool found = false;
+ UniqueKey *uniquekey2 = (UniqueKey *) lfirst(key2);
+
+ foreach(key1, keys1)
+ {
+ UniqueKey *uniquekey1 = (UniqueKey *) lfirst(key1);
+
+ if (uniquekey1->eq_clause == uniquekey2->eq_clause)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * make_canonical_uniquekey
+ * Given the parameters for a UniqueKey, find any pre-existing matching
+ * uniquekey in the query's list of "canonical" uniquekeys. Make a new
+ * entry if there's not one already.
+ *
+ * Note that this function must not be used until after we have completed
+ * merging EquivalenceClasses. (We don't try to enforce that here; instead,
+ * equivclass.c will complain if a merge occurs after root->canon_uniquekeys
+ * has become nonempty.)
+ */
+static UniqueKey *
+make_canonical_uniquekey(PlannerInfo *root,
+ EquivalenceClass *eclass)
+{
+ UniqueKey *uk;
+ ListCell *lc;
+ MemoryContext oldcontext;
+
+ /* The passed eclass might be non-canonical, so chase up to the top */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ foreach(lc, root->canon_uniquekeys)
+ {
+ uk = (UniqueKey *) lfirst(lc);
+ if (eclass == uk->eq_clause)
+ return uk;
+ }
+
+ /*
+ * Be sure canonical uniquekeys are allocated in the main planning context.
+ * Not an issue in normal planning, but it is for GEQO.
+ */
+ oldcontext = MemoryContextSwitchTo(root->planner_cxt);
+
+ uk = makeNode(UniqueKey);
+ uk->eq_clause = eclass;
+
+ root->canon_uniquekeys = lappend(root->canon_uniquekeys, uk);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return uk;
+}
+
+/*
+ * Build a list of unique keys
+ */
+static List*
+build_uniquekeys(PlannerInfo *root, List *pathkeys)
+{
+ List *result = NIL;
+ ListCell *l;
+
+ /* Create a uniquekey and add it to the list */
+ foreach(l, pathkeys)
+ {
+ UniqueKey *unique_key;
+ PathKey *pk = (PathKey *) lfirst(l);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ unique_key = make_canonical_uniquekey(root, ec);
+ result = lappend(result, unique_key);
+ }
+
+ return result;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ca3b7f29e1..dab3142e51 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3650,13 +3650,23 @@ standard_qp_callback(PlannerInfo *root, void *extra)
* much easier, since we know that the parser ensured that one is a
* superset of the other.
*/
+ root->query_uniquekeys = NIL;
+
if (root->group_pathkeys)
+ {
root->query_pathkeys = root->group_pathkeys;
+
+ if (!root->parse->hasAggs)
+ root->query_uniquekeys = build_group_uniquekeys(root);
+ }
else if (root->window_pathkeys)
root->query_pathkeys = root->window_pathkeys;
else if (list_length(root->distinct_pathkeys) >
list_length(root->sort_pathkeys))
+ {
root->query_pathkeys = root->distinct_pathkeys;
+ root->query_uniquekeys = build_distinct_uniquekeys(root);
+ }
else if (root->sort_pathkeys)
root->query_pathkeys = root->sort_pathkeys;
else
@@ -6216,7 +6226,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of index scan */
indexScanPath = create_index_path(root, indexInfo,
- NIL, NIL, NIL, NIL,
+ NIL, NIL, NIL, NIL, NIL,
ForwardScanDirection, false,
NULL, 1.0, false);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0ac73984d2..13766e0bd1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -941,6 +941,7 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = parallel_workers;
pathnode->pathkeys = NIL; /* seqscan has unordered result */
+ pathnode->unique_keys = NIL;
cost_seqscan(pathnode, root, rel, pathnode->param_info);
@@ -965,6 +966,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* samplescan has unordered result */
+ pathnode->unique_keys = NIL;
cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -1001,6 +1003,7 @@ create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
@@ -1019,6 +1022,7 @@ create_index_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = 0;
pathnode->path.pathkeys = pathkeys;
+ pathnode->path.unique_keys = uniquekeys;
pathnode->indexinfo = index;
pathnode->indexclauses = indexclauses;
@@ -1062,6 +1066,7 @@ create_bitmap_heap_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = parallel_degree;
pathnode->path.pathkeys = NIL; /* always unordered */
+ pathnode->path.unique_keys = NIL;
pathnode->bitmapqual = bitmapqual;
@@ -1923,6 +1928,7 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = pathkeys;
+ pathnode->unique_keys = NIL;
cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1949,6 +1955,7 @@ create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_keys = NIL;
cost_tablefuncscan(pathnode, root, rel, pathnode->param_info);
@@ -1975,6 +1982,7 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_keys = NIL;
cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -2000,6 +2008,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* XXX for now, result is always unordered */
+ pathnode->unique_keys = NIL;
cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -2026,6 +2035,7 @@ create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_keys = NIL;
cost_namedtuplestorescan(pathnode, root, rel, pathnode->param_info);
@@ -2052,6 +2062,7 @@ create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_keys = NIL;
cost_resultscan(pathnode, root, rel, pathnode->param_info);
@@ -2078,6 +2089,7 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->unique_keys = NIL;
/* Cost is the same as for a regular CTE scan */
cost_ctescan(pathnode, root, rel, pathnode->param_info);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..a9b67c64f8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -261,6 +261,7 @@ typedef enum NodeTag
T_EquivalenceMember,
T_PathKey,
T_PathTarget,
+ T_UniqueKey,
T_RestrictInfo,
T_IndexClause,
T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..485986a61a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -267,6 +267,8 @@ struct PlannerInfo
List *canon_pathkeys; /* list of "canonical" PathKeys */
+ List *canon_uniquekeys; /* list of "canonical" UniqueKeys */
+
List *left_join_clauses; /* list of RestrictInfos for mergejoinable
* outer join clauses w/nonnullable var on
* left */
@@ -295,6 +297,8 @@ struct PlannerInfo
List *query_pathkeys; /* desired pathkeys for query_planner() */
+ List *query_uniquekeys; /* */
+
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
@@ -1071,6 +1075,15 @@ typedef struct ParamPathInfo
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+ NodeTag type;
+
+ EquivalenceClass *eq_clause; /* equivalence class */
+} UniqueKey;
/*
* Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1100,6 +1113,9 @@ typedef struct ParamPathInfo
*
* "pathkeys" is a List of PathKey nodes (see above), describing the sort
* ordering of the path's output rows.
+ *
+ * "unique_keys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
*/
typedef struct Path
{
@@ -1123,6 +1139,8 @@ typedef struct Path
List *pathkeys; /* sort ordering of path's output */
/* pathkeys is a List of PathKey nodes; see above */
+
+ List *unique_keys; /* the unique keys, or NIL if none */
} Path;
/* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index cbff56a724..196d3a0783 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -16,7 +16,6 @@
#include "executor/tuptable.h"
-
#define nodeDisplay(x) pprint(x)
extern void print(const void *obj);
@@ -28,6 +27,7 @@ extern char *pretty_format_node_dump(const char *dump);
extern void print_rt(const List *rtable);
extern void print_expr(const Node *expr, const List *rtable);
extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_unique_keys(const List *unique_keys, const List *rtable);
extern void print_tl(const List *tlist, const List *rtable);
extern void print_slot(TupleTableSlot *slot);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffeef4b..374cac157b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..10b6d2a8c7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -235,4 +235,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+/*
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ */
+extern List *build_group_uniquekeys(PlannerInfo *root);
+extern List *build_distinct_uniquekeys(PlannerInfo *root);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+
#endif /* PATHS_H */
--
2.21.0
On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
I verified that the backwards index scan is indeed functioning now. However,
I'm afraid it's not that simple, as I think the cursor case is broken now. I
think having just the 'scan direction' in the btree code is not enough to get
this working properly, because we need to know whether we want the minimum or
maximum element of a certain prefix. There are basically four cases:- Forward Index Scan + Forward cursor: we want the minimum element within a
prefix and we want to skip 'forward' to the next prefix- Forward Index Scan + Backward cursor: we want the minimum element within a
prefix and we want to skip 'backward' to the previous prefix- Backward Index Scan + Forward cursor: we want the maximum element within a
prefix and we want to skip 'backward' to the previous prefix- Backward Index Scan + Backward cursor: we want the maximum element within a
prefix and we want to skip 'forward' to the next prefixThese cases make it rather complicated unfortunately. They do somewhat tie in
with the previous discussion on this thread about being able to skip to the
min or max value. If we ever want to support a sort of minmax scan, we'll
encounter the same issues.
Yes, these four cases are indeed a very good point. I've prepared a new version
of the patch, where they + an index condition and handling of situations when
it eliminated one or more unique elements are addressed. It seems fixes issues
and works also for those hypothetical examples you've mentioned above, but of
course it looks pretty complicated and I need to polish it a bit before
posting.
On Thu, Jul 11, 2019 at 12:13 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote:
SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
Data (imagine every tuple here actually occurs 10.000 times in the index to
see the benefit of skipping):
1,1,1
1,1,2
1,2,2
1,2,3
2,2,1
2,2,3
3,1,1
3,1,2
3,2,2
3,2,3Creating a cursor on this query and then moving forward, you should get
(1,1,2), (3,1,2). In the current implementation of the patch, after
bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves
forward one-by-one until it finds a match. This match only comes at (3,1,2)
however. Then it skips to the end.If you move the cursor backwards from the end of the cursor, you should
still get (3,1,2) (1,1,2). A possible implementation would start at the end
and do a skip to the beginning of the prefix: (3,1,1). Then it needs to
move forward one-by-one in order to find the first matching (minimum) item
(3,1,2). When it finds it, it needs to skip backwards to the beginning of
prefix 2 (2,2,1). It needs to move forwards to find the minimum element,
but should stop as soon as it detects that the prefix doesn't match anymore
(because there is no match for prefix 2, it will move all the way from
(2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of
prefix 1: (1,1,1) and scan forward to find (1,1,2).
Perhaps anyone can think of an easier way to implement it?One option is just don't implement it and just change
ExecSupportsBackwardScan() so that it returns false for skip index
scans, or perhaps at least implement an index am method to allow the
planner to be able to determine if the index implementation supports
it... amcanskipbackward
Yep, it was discussed few times in this thread, and after we've discovered
(thanks to Floris) so many issues I was also one step away from implementing
this idea. But at the time time as Thomas correctly noticed, our implementation
needs to be extensible to handle future use cases, and this particular cursor
juggling seems already like a pretty good example of such "future use case". So
I hope by dealing with it we can also figure out what needs to be extensible.
On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
Here is a patch more in that direction.
Some questions:
1) Do we really need the UniqueKey struct ? If it only contains the
EquivalenceClass field then we could just have a list of those instead.
That would make the patch simpler.2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
the patch only uses canon_uniquekeys because the we attach the list
directly on the Path node.I'll clean the patch up based on your feedback, and then start to rebase
v21 on it.
Thanks! I'll also take a look as soon an I'm finished with the last updates.
On Wed, 17 Jul 2019 at 04:53, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Here is a patch more in that direction.
Thanks. I've just had a look over this and it roughly what I have in mind.
Here are the comments I noted down during the review:
cost_index:
I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.
For example, in a query such as:
SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
you only need to perform estimate_num_groups() on "y".
I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.
build_index_paths:
I don't quite see where you're checking if the query's unique_keys
match what unique keys can be produced by the index. This is done for
pathkeys with:
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
!found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
{
index_pathkeys = build_index_pathkeys(root, index,
ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);
orderbyclauses = NIL;
orderbyclausecols = NIL;
}
Here has_useful_pathkeys() checks if the query requires some ordering.
For unique keys you'll want to do the same. You'll have set the
query's unique key requirements in standard_qp_callback().
I think basically build_index_paths() should be building index paths
with unique keys, for all indexes that can support the query's unique
keys. I'm just a bit uncertain if we need to create both a normal
index path and another path for the same index with unique keys.
Perhaps we can figure that out down the line somewhere. Maybe it's
best to build path types for now, when possible, and we can consider
later if we can skip the non-uniquekey paths. Likely that would
require a big XXX comment to explain we need to review that before the
code makes it into core.
get_uniquekeys_for_index:
I think this needs to follow more the lead from build_index_pathkeys.
Basically, ask the index what it's pathkeys are.
standard_qp_callback:
build_group_uniquekeys & build_distinct_uniquekeys could likely be one
function that takes a list of SortGroupClause. You just either pass
the groupClause or distinctClause in. Pretty much the UniqueKey
version of make_pathkeys_for_sortclauses().
Some questions:
1) Do we really need the UniqueKey struct ? If it only contains the
EquivalenceClass field then we could just have a list of those instead.
That would make the patch simpler.
I dunno about that. I understand it looks a bit pointless due to just
having one field, but perhaps we can worry about that later. If we
choose to ditch it and replace it with just an EquivalenceClass then
we can do that later.
2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
the patch only uses canon_uniquekeys because the we attach the list
directly on the Path node.
canon_uniquekeys should store at most one UniqueKey per
EquivalenceClass. The reason for this is for fast comparison. We can
compare memory addresses rather than checking individual fields are
equal. Now... yeah it's true that there is only one field so far and
we could just check the pointers are equal on the EquivalenceClasses,
but I think maybe this is in the same boat as #1. Let's do it for now
so we're sticking as close to the guidelines laid out by PathKeys and
once it's all working and plugged into skip scans then we can decide
if it needs a simplification pass over the code.
I'll clean the patch up based on your feedback, and then start to rebase
v21 on it.
Cool.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 7/22/19 1:44 AM, David Rowley wrote:
Here are the comments I noted down during the review:
cost_index:
I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.For example, in a query such as:
SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
you only need to perform estimate_num_groups() on "y".
I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.
I think this part will become more clear once the index skip scan patch
is rebased, as we got the uniquekeys field on the Path, and the
indexskipprefixy info on the IndexPath node.
build_index_paths:
I don't quite see where you're checking if the query's unique_keys
match what unique keys can be produced by the index. This is done for
pathkeys with:pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
!found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
{
index_pathkeys = build_index_pathkeys(root, index,
ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);
orderbyclauses = NIL;
orderbyclausecols = NIL;
}Here has_useful_pathkeys() checks if the query requires some ordering.
For unique keys you'll want to do the same. You'll have set the
query's unique key requirements in standard_qp_callback().I think basically build_index_paths() should be building index paths
with unique keys, for all indexes that can support the query's unique
keys. I'm just a bit uncertain if we need to create both a normal
index path and another path for the same index with unique keys.
Perhaps we can figure that out down the line somewhere. Maybe it's
best to build path types for now, when possible, and we can consider
later if we can skip the non-uniquekey paths. Likely that would
require a big XXX comment to explain we need to review that before the
code makes it into core.get_uniquekeys_for_index:
I think this needs to follow more the lead from build_index_pathkeys.
Basically, ask the index what it's pathkeys are.standard_qp_callback:
build_group_uniquekeys & build_distinct_uniquekeys could likely be one
function that takes a list of SortGroupClause. You just either pass
the groupClause or distinctClause in. Pretty much the UniqueKey
version of make_pathkeys_for_sortclauses().
Yeah, I'll move this part of the index skip scan patch to the unique key
patch.
Thanks for your feedback !
Best regards,
Jesper
On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 7/22/19 1:44 AM, David Rowley wrote:
Here are the comments I noted down during the review:
cost_index:
I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.For example, in a query such as:
SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
you only need to perform estimate_num_groups() on "y".
I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.I think this part will become more clear once the index skip scan patch
is rebased, as we got the uniquekeys field on the Path, and the
indexskipprefixy info on the IndexPath node.
Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.
Also I've added few tags, mostly to mention reviewers contribution.
Attachments:
v22-0001-Index-skip-scan.patchapplication/octet-stream; name=v22-0001-Index-skip-scan.patchDownload
From 454013e6c9bc87a9e7686501b09a75f3ab2dffd7 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v22] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 ++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 652 +++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 43 +-
src/backend/executor/nodeIndexscan.c | 43 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 79 ++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 434 ++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 163 +++++
41 files changed, 1773 insertions(+), 24 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9644b9f8cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7d..328c17f13a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..6d017a2337 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..b68f096b3a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..81c5a37def 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,6 +28,8 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
+static bool _bt_read_closest(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
@@ -37,7 +39,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1385,315 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * This function in conjunction with _bt_read_closest handles them all.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+ OffsetNumber startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_read_closest(scan, dir, indexdir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case, advance forward and read also forward. At this moment we
+ * are at the next distinct key at the beginning of the series. Go back one
+ * step and let _bt_read_closest figure out about index condition.
+ */
+ if (ScanDirectionIsForward(dir) && ScanDirectionIsForward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_read_closest do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ _bt_read_closest(scan, dir, dir, offnum);
+
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset = startOffset;
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_read_closest returns already found item */
+ if (_bt_read_closest(scan, dir, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_read_closest(scan, dir, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -1596,6 +1910,293 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
return (so->currPos.firstItem <= so->currPos.lastItem);
}
+/*
+ * _bt_read_closest() -- Load data from closest two items, previous and
+ * current on one the current index page into
+ * so->currPos. Previous may be not passing index
+ * condition, but it is needed for skip scan.
+ *
+ * Similar to _bt_readpage, except that it reads only a current and a
+ * previous item. So far it is being used for _bt_skip.
+ *
+ * Returns true if required two matching items found on the page, false
+ * otherwise.
+ */
+static bool
+_bt_read_closest(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ IndexTuple prevItup = NULL;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(indexdir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsBackward(dir))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ ItemPointerData tid = prevItup->t_tid;
+ OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);
+
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+ itemIndex++;
+
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ if (itemIndex == 2)
+ {
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ /*
+ * Actual itemIndex depends on in which direction do we
+ * advance if this direction is different from indexdir
+ */
+ so->currPos.itemIndex = ScanDirectionIsForward(dir) ? 0 : 1;
+ so->currPos.lastItem = 1;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ prevItup = itup;
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxIndexTuplesPerPage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsForward(dir))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ ItemPointerData tid = prevItup->t_tid;
+ OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
+ if (MaxIndexTuplesPerPage - itemIndex == 2)
+ {
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = MaxIndexTuplesPerPage - 2;
+ /*
+ * Actual itemIndex depends on in which direction do we
+ * advance if this direction is different from indexdir
+ */
+ so->currPos.itemIndex = MaxIndexTuplesPerPage -
+ (ScanDirectionIsForward(dir) ? 2 : 1);
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ prevItup = itup;
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ }
+
+ /* Not all of the closest items were found */
+ return false;
+}
+
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2249,3 +2850,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..8db74287a5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8a4d795d1a..e88040eca5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -72,7 +73,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,6 +116,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_tid(scandesc, direction))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -253,6 +290,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -503,6 +542,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..6e649930c2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -92,7 +93,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +117,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +129,42 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_slot(scandesc, direction, slot))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +187,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +945,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..8bb0b3eaee 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..44286a86e8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2208,6 +2210,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..45354a0b95 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..6e0fe90e5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..af7d9c4270 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -94,6 +95,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -133,22 +158,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1096,6 +1111,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..e4acdec0e0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2903,7 +2905,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2914,7 +2917,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5150,7 +5154,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5167,6 +5172,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5179,7 +5185,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5194,6 +5201,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..663be21597 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3615,12 +3615,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4807,6 +4816,70 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ (path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan) &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+
+ index = ((IndexPath *) path)->indexinfo;
+
+ /*
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ if (index->indexkeys[i] != var->varattno)
+ {
+ different_columns_order = true;
+ break;
+ }
+
+ i++;
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ int distinctPrefixKeys =
+ list_length(root->uniq_distinct_pathkeys);
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..df9b57215f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2928,6 +2928,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 40f497660d..8c05b3bb5c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..007c8ac14e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -912,6 +912,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..99facc8f50 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..95231ae0b7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +806,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..df82c5d6dd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..eeff4a2935 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -829,6 +834,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1165,6 +1171,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1177,6 +1186,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..72b4681613 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..ad28c7f54a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fa461201a7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -202,6 +202,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5305b53cac..056de928fe 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..09231fdb88 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,437 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Index Only Scan using tenk1_ten_four on tenk1
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index cd46f071bd..04760639a8 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..22222592ee 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,166 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.21.0
Hello.
At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com>
On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 7/22/19 1:44 AM, David Rowley wrote:
Here are the comments I noted down during the review:
cost_index:
I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.For example, in a query such as:
SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
you only need to perform estimate_num_groups() on "y".
I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.I think this part will become more clear once the index skip scan patch
is rebased, as we got the uniquekeys field on the Path, and the
indexskipprefixy info on the IndexPath node.Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.Also I've added few tags, mostly to mention reviewers contribution.
I have some comments.
+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.
It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:
if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
It is perfect when needed_pathkeys is distinct_pathkeys. So
additional check is required if and only if it is not the case.
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
(path->pathtype == T_IndexOnlyScan ||
path->pathtype == T_IndexScan) &&
(needed_pathkeys == root->distinct_pathkeys ||
pathkeys_contained_in(root->distinct_pathkeys,
path->pathkeys)))
path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().
I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?
+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */
Why can't we use skipscan path if having filter condition? If
something bad happens, the reason must be written here instead of
what we do.
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)
It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
It doesn't seem needed to me. Could you elaborate on the reason
for that?
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Sorry, there's a too-hard-to-read part.
At Thu, 25 Jul 2019 20:17:37 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190725.201737.192223037.horikyota.ntt@gmail.com>
Hello.
At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com>
On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 7/22/19 1:44 AM, David Rowley wrote:
Here are the comments I noted down during the review:
cost_index:
I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.For example, in a query such as:
SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
you only need to perform estimate_num_groups() on "y".
I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.I think this part will become more clear once the index skip scan patch
is rebased, as we got the uniquekeys field on the Path, and the
indexskipprefixy info on the IndexPath node.Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.Also I've added few tags, mostly to mention reviewers contribution.
I have some comments.
+ * The order of columns in the index should be the same, as for + * unique distincs pathkeys, otherwise we cannot use _bt_search + * in the skip implementation - this can lead to a missing + * records.It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
It is perfect when needed_pathkeys is distinct_pathkeys. So
additional check is required if and only if it is not the case.
So I think the following change will work.
- + /* Consider index skip scan as well */
- + if (enable_indexskipscan &&
- + IsA(path, IndexPath) &&
- + ((IndexPath *) path)->indexinfo->amcanskip &&
- + (path->pathtype == T_IndexOnlyScan ||
- + path->pathtype == T_IndexScan) &&
- + root->distinct_pathkeys != NIL)
+ + if (enable_indexskipscan &&
+ + IsA(path, IndexPath) &&
+ + ((IndexPath *) path)->indexskipprefix >= 0 &&
+ + (needed_pathkeys == root->distinct_pathkeys ||
+ + pathkeys_contained_in(root->distinct_pathkeys,
+ + path->pathkeys)))
Additional comments on the condition above are:
path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?+ /* + * XXX: In case of index scan quals evaluation happens after + * ExecScanFetch, which means skip results could be fitered out + */Why can't we use skipscan path if having filter condition? If
something bad happens, the reason must be written here instead of
what we do.+ if (path->pathtype == T_IndexScan && + parse->jointree != NULL && + parse->jointree->quals != NULL && + ((List *)parse->jointree->quals)->length != 0)It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point.It doesn't seem needed to me. Could you elaborate on the reason
for that?+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point. + */ + if (direction * indexonlyscan->indexorderdir < 0 && + !node->ioss_FirstTupleEmitted)I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
I have some comments.
Thank you for the review!
+ * The order of columns in the index should be the same, as for + * unique distincs pathkeys, otherwise we cannot use _bt_search + * in the skip implementation - this can lead to a missing + * records.It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:
Looks like you're right. After looking closely there seems to be an issue in
the original implementation, when we use wrong prefix_size in such cases.
Without this problem this condition is indeed enough.
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
(path->pathtype == T_IndexOnlyScan ||
path->pathtype == T_IndexScan) &&
(needed_pathkeys == root->distinct_pathkeys ||
pathkeys_contained_in(root->distinct_pathkeys,
path->pathkeys)))path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().+ if (path->pathtype == T_IndexScan && + parse->jointree != NULL && + parse->jointree->quals != NULL && + ((List *)parse->jointree->quals)->length != 0)It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)
Yeah, will change both (hopefully soon)
+ /* + * XXX: In case of index scan quals evaluation happens after + * ExecScanFetch, which means skip results could be fitered out + */Why can't we use skipscan path if having filter condition? If
something bad happens, the reason must be written here instead of
what we do.
Sorry, looks like I've failed to express this more clear in the commentary. The
point is that when an index scan (not for index only scan) has some conditions,
their evaluation happens after skipping, and I don't see any not too much
invasive way to apply skip correctly.
+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point.It doesn't seem needed to me. Could you elaborate on the reason
for that?
This is needed for e.g. scan with a cursor backward without an index condition.
E.g. if we have:
1 1 2 2 3 3
1 2 3 4 5 6
and do
DECLARE c SCROLL CURSOR FOR
SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;
FETCH ALL FROM c;
we should get
1 2 3
1 3 5
When afterwards we do
FETCH BACKWARD ALL FROM c;
we should get
3 2 1
5 2 1
If we will use _bt_next first time without _bt_skip, the first pair would be
3 6 (the first from the end of the tuples, not from the end of the cursor).
+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point. + */ + if (direction * indexonlyscan->indexorderdir < 0 && + !node->ioss_FirstTupleEmitted)I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)
I'm not sure I follow, what do you mean by compensated? In general you're
right, as David Rowley mentioned above, indexorderdir is a general scan
direction, and direction is flipped estate->es_direction, which is a cursor
direction. The goal of this condition is catch when those two are different,
and we need to advance and read in different directions.
Hello.
On 2019/07/29 4:17, Dmitry Dolgov wrote:>> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Yeah, will change both (hopefully soon)
Thanks.
+ /* + * XXX: In case of index scan quals evaluation happens after + * ExecScanFetch, which means skip results could be fitered out + */Why can't we use skipscan path if having filter condition? If
something bad happens, the reason must be written here instead of
what we do.Sorry, looks like I've failed to express this more clear in the
commentary. The point is that when an index scan (not for index
only scan) has some conditions, their evaluation happens after
skipping, and I don't see any not too much invasive way to
apply skip correctly.
Yeah, your explanation was perfect for me. What I failed to
understand was what is expected to be done in the case. I
reconsidered and understood that:
For example, the following query:
select distinct (a, b) a, b, c from t where c < 100;
skip scan returns one tuple for one distinct set of (a, b) with
arbitrary one of c, If the choosed c doesn't match the qual and
there is any c that matches the qual, we miss that tuple.
If this is correct, an explanation like the above might help.
+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point.It doesn't seem needed to me. Could you elaborate on the reason
for that?This is needed for e.g. scan with a cursor backward without an index condition.
E.g. if we have:1 1 2 2 3 3
1 2 3 4 5 6and do
DECLARE c SCROLL CURSOR FOR
SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;FETCH ALL FROM c;
we should get
1 2 3
1 3 5When afterwards we do
FETCH BACKWARD ALL FROM c;
we should get
3 2 1
5 2 1If we will use _bt_next first time without _bt_skip, the first pair would be
3 6 (the first from the end of the tuples, not from the end of the cursor).
Thanks for the explanation. Sorry, I somehow thought that that is
right. You're right.
+ * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point. + */ + if (direction * indexonlyscan->indexorderdir < 0 && + !node->ioss_FirstTupleEmitted)I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)I'm not sure I follow, what do you mean by compensated? In general you're
I meant that the "direction" is already changed to physical order
at the point.
right, as David Rowley mentioned above, indexorderdir is a general scan
direction, and direction is flipped estate->es_direction, which is a cursor
direction. The goal of this condition is catch when those two are different,
and we need to advance and read in different directions.
Mmm. Sorry and thank you for the explanation. I was
stupid. You're right. I perhaps mistook indexorderdir's
meaning. Maybe something like the following will work *for me*:p
| When we are fetching a cursor in backward direction, return the
| tuples that forward fetching should have returned. In other
| words, we return the last scanned tuple in a DISTINCT set. Skip
| to that tuple before returning the first tuple.
# Of course, I need someone to correct this!
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?
Looks like it's not that straightforward to do this only in create_index_path,
since to make this decision we need to have both parts, indexinfo and distinct
keys.
Yeah, your explanation was perfect for me. What I failed to
understand was what is expected to be done in the case. I
reconsidered and understood that:For example, the following query:
select distinct (a, b) a, b, c from t where c < 100;
skip scan returns one tuple for one distinct set of (a, b) with
arbitrary one of c, If the choosed c doesn't match the qual and
there is any c that matches the qual, we miss that tuple.If this is correct, an explanation like the above might help.
Yes, that's correct, I've added this into commentaries.
Maybe something like the following will work *for me*:p
| When we are fetching a cursor in backward direction, return the
| tuples that forward fetching should have returned. In other
| words, we return the last scanned tuple in a DISTINCT set. Skip
| to that tuple before returning the first tuple.
And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.
Hi,
On 8/2/19 8:14 AM, Dmitry Dolgov wrote:
And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.
Yes.
We decided to send this now, although there is still feedback from David
that needs to be considered/acted on.
The patches can be reviewed independently, but we will send them as a
set from now on. Development of UniqueKey will be kept separate though [1]https://github.com/jesperpedersen/postgres/tree/uniquekey.
Note, that while UniqueKey can form the foundation of optimizations for
GROUP BY queries it isn't the focus for this patch series. Contributions
are very welcomed of course.
[1]: https://github.com/jesperpedersen/postgres/tree/uniquekey
Best regards,
Jesper
Attachments:
v23_0001-Unique-key.patchtext/x-patch; name=v23_0001-Unique-key.patchDownload
From 35018a382d792d6ceeb8d0e9d16bc14ea2e3f148 Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Fri, 2 Aug 2019 07:52:08 -0400
Subject: [PATCH 1/2] Unique key
Design by David Rowley.
Author: Jesper Pedersen
---
src/backend/nodes/outfuncs.c | 14 +++
src/backend/nodes/print.c | 39 +++++++
src/backend/optimizer/path/Makefile | 2 +-
src/backend/optimizer/path/allpaths.c | 8 ++
src/backend/optimizer/path/costsize.c | 5 +
src/backend/optimizer/path/indxpath.c | 41 +++++++
src/backend/optimizer/path/pathkeys.c | 72 ++++++++++--
src/backend/optimizer/path/uniquekey.c | 147 +++++++++++++++++++++++++
src/backend/optimizer/plan/planner.c | 18 ++-
src/backend/optimizer/util/pathnode.c | 12 ++
src/backend/optimizer/util/tlist.c | 1 -
src/include/nodes/nodes.h | 1 +
src/include/nodes/pathnodes.h | 18 +++
src/include/nodes/print.h | 2 +-
src/include/optimizer/pathnode.h | 1 +
src/include/optimizer/paths.h | 11 ++
16 files changed, 377 insertions(+), 15 deletions(-)
create mode 100644 src/backend/optimizer/path/uniquekey.c
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..9a4f3e8e4b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
WRITE_FLOAT_FIELD(startup_cost, "%.2f");
WRITE_FLOAT_FIELD(total_cost, "%.2f");
WRITE_NODE_FIELD(pathkeys);
+ WRITE_NODE_FIELD(uniquekeys);
}
/*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(eq_classes);
WRITE_BOOL_FIELD(ec_merging_done);
WRITE_NODE_FIELD(canon_pathkeys);
+ WRITE_NODE_FIELD(canon_uniquekeys);
WRITE_NODE_FIELD(left_join_clauses);
WRITE_NODE_FIELD(right_join_clauses);
WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(placeholder_list);
WRITE_NODE_FIELD(fkey_list);
WRITE_NODE_FIELD(query_pathkeys);
+ WRITE_NODE_FIELD(query_uniquekeys);
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
WRITE_BOOL_FIELD(pk_nulls_first);
}
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+ WRITE_NODE_TYPE("UNIQUEKEY");
+
+ WRITE_NODE_FIELD(eq_clause);
+}
+
static void
_outPathTarget(StringInfo str, const PathTarget *node)
{
@@ -4073,6 +4084,9 @@ outNode(StringInfo str, const void *obj)
case T_PathKey:
_outPathKey(str, obj);
break;
+ case T_UniqueKey:
+ _outUniqueKey(str, obj);
+ break;
case T_PathTarget:
_outPathTarget(str, obj);
break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..62c9d4ef7a 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
}
+/*
+ * print_uniquekeys -
+ * unique_key an UniqueKey
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+ ListCell *l;
+
+ printf("(");
+ foreach(l, uniquekeys)
+ {
+ UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+ EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+ ListCell *k;
+ bool first = true;
+
+ /* chase up */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ printf("(");
+ foreach(k, eclass->ec_members)
+ {
+ EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+ if (first)
+ first = false;
+ else
+ printf(", ");
+ print_expr((Node *) mem->em_expr, rtable);
+ }
+ printf(")");
+ if (lnext(uniquekeys, l))
+ printf(", ");
+ }
+ printf(")\n");
+}
+
/*
* print_tl
* print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
- joinpath.o joinrels.o pathkeys.o tidpath.o
+ joinpath.o joinrels.o pathkeys.o tidpath.o uniquekey.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e9ee32b7f4..6f4c25f7dd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3957,6 +3957,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
+ if (path->uniquekeys)
+ {
+ for (i = 0; i < indent; i++)
+ printf("\t");
+ printf(" uniquekeys: ");
+ print_uniquekeys(path->uniquekeys, root->parse->rtable);
+ }
+
if (join)
{
JoinPath *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3a9a994733..2565dcf296 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -705,6 +705,11 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.parallel_aware = true;
}
+ /* Consider cost based on unique key */
+ if (path->path.uniquekeys)
+ {
+ }
+
/*
* Now interpolate based on estimated index order correlation to get total
* disk I/O cost for main table accesses.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5f339fdfde..4b90dd378a 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -189,6 +189,7 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static List *get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys);
/*
@@ -874,6 +875,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
+ List *useful_uniquekeys;
bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
@@ -1036,11 +1038,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
if (index_clauses != NIL || useful_pathkeys != NIL || useful_predicate ||
index_only_scan)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1063,6 +1069,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1093,11 +1100,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
index_pathkeys);
if (useful_pathkeys != NIL)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -1115,6 +1126,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -3369,6 +3381,35 @@ match_clause_to_ordering_op(IndexOptInfo *index,
return clause;
}
+/*
+ * get_uniquekeys_for_index
+ */
+static List *
+get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys)
+{
+ ListCell *lc;
+
+ if (pathkeys)
+ {
+ List *uniquekeys = NIL;
+ foreach(lc, pathkeys)
+ {
+ UniqueKey *unique_key;
+ PathKey *pk = (PathKey *) lfirst(lc);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ unique_key = makeNode(UniqueKey);
+ unique_key->eq_clause = ec;
+
+ lappend(uniquekeys, unique_key);
+ }
+
+ if (uniquekeys_contained_in(root->canon_uniquekeys, uniquekeys))
+ return uniquekeys;
+ }
+
+ return NIL;
+}
/****************************************************************************
* ---- ROUTINES TO DO PARTIAL INDEX PREDICATE TESTS ----
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..0cba366c06 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already is already in the list, then not unique */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return false;
+ }
+
+ return true;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return !pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ * Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, sortclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ if (pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+
+ return pathkeys;
+}
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekey.c b/src/backend/optimizer/path/uniquekey.c
new file mode 100644
index 0000000000..13d4ebb98c
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekey.c
@@ -0,0 +1,147 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/optimizer/path/uniquekey.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "nodes/pg_list.h"
+
+static UniqueKey *make_canonical_uniquekey(PlannerInfo *root, EquivalenceClass *eclass);
+
+/*
+ * Build a list of unique keys
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+ List *result = NIL;
+ List *sortkeys;
+ ListCell *l;
+
+ sortkeys = make_pathkeys_for_uniquekeys(root,
+ sortclauses,
+ root->processed_tlist);
+
+ /* Create a uniquekey and add it to the list */
+ foreach(l, sortkeys)
+ {
+ PathKey *pathkey = (PathKey *) lfirst(l);
+ EquivalenceClass *ec = pathkey->pk_eclass;
+ UniqueKey *unique_key = make_canonical_uniquekey(root, ec);
+
+ result = lappend(result, unique_key);
+ }
+
+ return result;
+}
+
+/*
+ * uniquekeys_contained_in
+ * Are the keys2 included in the keys1 superset
+ */
+bool
+uniquekeys_contained_in(List *keys1, List *keys2)
+{
+ ListCell *key1,
+ *key2;
+
+ /*
+ * Fall out quickly if we are passed two identical lists. This mostly
+ * catches the case where both are NIL, but that's common enough to
+ * warrant the test.
+ */
+ if (keys1 == keys2)
+ return true;
+
+ foreach(key2, keys2)
+ {
+ bool found = false;
+ UniqueKey *uniquekey2 = (UniqueKey *) lfirst(key2);
+
+ foreach(key1, keys1)
+ {
+ UniqueKey *uniquekey1 = (UniqueKey *) lfirst(key1);
+
+ if (uniquekey1->eq_clause == uniquekey2->eq_clause)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * has_useful_uniquekeys
+ * Detect whether the planner could have any uniquekeys that are
+ * useful.
+ */
+bool
+has_useful_uniquekeys(PlannerInfo *root)
+{
+ if (root->query_uniquekeys != NIL)
+ return true; /* there are some */
+ return false; /* definitely useless */
+}
+
+/*
+ * make_canonical_uniquekey
+ * Given the parameters for a UniqueKey, find any pre-existing matching
+ * uniquekey in the query's list of "canonical" uniquekeys. Make a new
+ * entry if there's not one already.
+ *
+ * Note that this function must not be used until after we have completed
+ * merging EquivalenceClasses. (We don't try to enforce that here; instead,
+ * equivclass.c will complain if a merge occurs after root->canon_uniquekeys
+ * has become nonempty.)
+ */
+static UniqueKey *
+make_canonical_uniquekey(PlannerInfo *root,
+ EquivalenceClass *eclass)
+{
+ UniqueKey *uk;
+ ListCell *lc;
+ MemoryContext oldcontext;
+
+ /* The passed eclass might be non-canonical, so chase up to the top */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ foreach(lc, root->canon_uniquekeys)
+ {
+ uk = (UniqueKey *) lfirst(lc);
+ if (eclass == uk->eq_clause)
+ return uk;
+ }
+
+ /*
+ * Be sure canonical uniquekeys are allocated in the main planning context.
+ * Not an issue in normal planning, but it is for GEQO.
+ */
+ oldcontext = MemoryContextSwitchTo(root->planner_cxt);
+
+ uk = makeNode(UniqueKey);
+ uk->eq_clause = eclass;
+
+ root->canon_uniquekeys = lappend(root->canon_uniquekeys, uk);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return uk;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 8f51f59f8a..5ee9ee6595 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3656,15 +3656,31 @@ standard_qp_callback(PlannerInfo *root, void *extra)
* much easier, since we know that the parser ensured that one is a
* superset of the other.
*/
+ root->canon_uniquekeys = NIL;
+ root->query_uniquekeys = NIL;
+
if (root->group_pathkeys)
+ {
root->query_pathkeys = root->group_pathkeys;
+
+ if (!root->parse->hasAggs)
+ root->query_uniquekeys = build_uniquekeys(root, qp_extra->groupClause);
+ }
else if (root->window_pathkeys)
root->query_pathkeys = root->window_pathkeys;
else if (list_length(root->distinct_pathkeys) >
list_length(root->sort_pathkeys))
+ {
root->query_pathkeys = root->distinct_pathkeys;
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else if (root->sort_pathkeys)
+ {
root->query_pathkeys = root->sort_pathkeys;
+
+ if (root->distinct_pathkeys)
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else
root->query_pathkeys = NIL;
}
@@ -6222,7 +6238,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of index scan */
indexScanPath = create_index_path(root, indexInfo,
- NIL, NIL, NIL, NIL,
+ NIL, NIL, NIL, NIL, NIL,
ForwardScanDirection, false,
NULL, 1.0, false);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0ac73984d2..ac0b937895 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -941,6 +941,7 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = parallel_workers;
pathnode->pathkeys = NIL; /* seqscan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_seqscan(pathnode, root, rel, pathnode->param_info);
@@ -965,6 +966,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* samplescan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -1001,6 +1003,7 @@ create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
@@ -1019,6 +1022,7 @@ create_index_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = 0;
pathnode->path.pathkeys = pathkeys;
+ pathnode->path.uniquekeys = uniquekeys;
pathnode->indexinfo = index;
pathnode->indexclauses = indexclauses;
@@ -1062,6 +1066,7 @@ create_bitmap_heap_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = parallel_degree;
pathnode->path.pathkeys = NIL; /* always unordered */
+ pathnode->path.uniquekeys = NIL;
pathnode->bitmapqual = bitmapqual;
@@ -1923,6 +1928,7 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = pathkeys;
+ pathnode->uniquekeys = NIL;
cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1949,6 +1955,7 @@ create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_tablefuncscan(pathnode, root, rel, pathnode->param_info);
@@ -1975,6 +1982,7 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -2000,6 +2008,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* XXX for now, result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -2026,6 +2035,7 @@ create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_namedtuplestorescan(pathnode, root, rel, pathnode->param_info);
@@ -2052,6 +2062,7 @@ create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_resultscan(pathnode, root, rel, pathnode->param_info);
@@ -2078,6 +2089,7 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
/* Cost is the same as for a regular CTE scan */
cost_ctescan(pathnode, root, rel, pathnode->param_info);
diff --git a/src/backend/optimizer/util/tlist.c b/src/backend/optimizer/util/tlist.c
index 7ccb10e4e1..618032e82c 100644
--- a/src/backend/optimizer/util/tlist.c
+++ b/src/backend/optimizer/util/tlist.c
@@ -427,7 +427,6 @@ get_sortgrouplist_exprs(List *sgClauses, List *targetList)
return result;
}
-
/*****************************************************************************
* Functions to extract data from a list of SortGroupClauses
*
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..a9b67c64f8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -261,6 +261,7 @@ typedef enum NodeTag
T_EquivalenceMember,
T_PathKey,
T_PathTarget,
+ T_UniqueKey,
T_RestrictInfo,
T_IndexClause,
T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index e3c579ee44..c1d6f33fc0 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -269,6 +269,8 @@ struct PlannerInfo
List *canon_pathkeys; /* list of "canonical" PathKeys */
+ List *canon_uniquekeys; /* list of "canonical" UniqueKeys */
+
List *left_join_clauses; /* list of RestrictInfos for mergejoinable
* outer join clauses w/nonnullable var on
* left */
@@ -297,6 +299,8 @@ struct PlannerInfo
List *query_pathkeys; /* desired pathkeys for query_planner() */
+ List *query_uniquekeys; /* */
+
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
@@ -1077,6 +1081,15 @@ typedef struct ParamPathInfo
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+ NodeTag type;
+
+ EquivalenceClass *eq_clause; /* equivalence class */
+} UniqueKey;
/*
* Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1106,6 +1119,9 @@ typedef struct ParamPathInfo
*
* "pathkeys" is a List of PathKey nodes (see above), describing the sort
* ordering of the path's output rows.
+ *
+ * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
*/
typedef struct Path
{
@@ -1129,6 +1145,8 @@ typedef struct Path
List *pathkeys; /* sort ordering of path's output */
/* pathkeys is a List of PathKey nodes; see above */
+
+ List *uniquekeys; /* the unique keys, or NIL if none */
} Path;
/* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index cbff56a724..31e8f0686e 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -16,7 +16,6 @@
#include "executor/tuptable.h"
-
#define nodeDisplay(x) pprint(x)
extern void print(const void *obj);
@@ -28,6 +27,7 @@ extern char *pretty_format_node_dump(const char *dump);
extern void print_rt(const List *rtable);
extern void print_expr(const Node *expr, const List *rtable);
extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_uniquekeys(const List *uniquekeys, const List *rtable);
extern void print_tl(const List *tlist, const List *rtable);
extern void print_slot(TupleTableSlot *slot);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffeef4b..374cac157b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..c7976d4a90 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -235,4 +238,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+/*
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
#endif /* PATHS_H */
--
2.21.0
v23_0002-Index-skip-scan.patchtext/x-patch; name=v23_0002-Index-skip-scan.patchDownload
From 79df11a9f74d781a4eaded67772ce0ef1890df80 Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Fri, 2 Aug 2019 08:10:05 -0400
Subject: [PATCH 2/2] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 ++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 652 +++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 46 +-
src/backend/executor/nodeIndexscan.c | 43 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planner.c | 76 ++
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 434 ++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 163 +++++
38 files changed, 1692 insertions(+), 10 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..e202589e98 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e9ca4b8252..55a8b16b8a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf733..c7c7b77b8c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,6 +28,8 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
+static bool _bt_read_closest(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
@@ -37,7 +39,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1380,6 +1385,315 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * This function in conjunction with _bt_read_closest handles them all.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+ OffsetNumber startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_read_closest(scan, dir, indexdir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case, advance forward and read also forward. At this moment we
+ * are at the next distinct key at the beginning of the series. Go back one
+ * step and let _bt_read_closest figure out about index condition.
+ */
+ if (ScanDirectionIsForward(dir) && ScanDirectionIsForward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_read_closest do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ _bt_read_closest(scan, dir, dir, offnum);
+
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset = startOffset;
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_read_closest returns already found item */
+ if (_bt_read_closest(scan, dir, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_read_closest(scan, dir, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -1596,6 +1910,293 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
return (so->currPos.firstItem <= so->currPos.lastItem);
}
+/*
+ * _bt_read_closest() -- Load data from closest two items, previous and
+ * current on one the current index page into
+ * so->currPos. Previous may be not passing index
+ * condition, but it is needed for skip scan.
+ *
+ * Similar to _bt_readpage, except that it reads only a current and a
+ * previous item. So far it is being used for _bt_skip.
+ *
+ * Returns true if required two matching items found on the page, false
+ * otherwise.
+ */
+static bool
+_bt_read_closest(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ IndexTuple prevItup = NULL;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(indexdir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsBackward(dir))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ ItemPointerData tid = prevItup->t_tid;
+ OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);
+
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+ itemIndex++;
+
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ if (itemIndex == 2)
+ {
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ /*
+ * Actual itemIndex depends on in which direction do we
+ * advance if this direction is different from indexdir
+ */
+ so->currPos.itemIndex = ScanDirectionIsForward(dir) ? 0 : 1;
+ so->currPos.lastItem = 1;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ prevItup = itup;
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxIndexTuplesPerPage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsForward(dir))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ ItemPointerData tid = prevItup->t_tid;
+ OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
+ if (MaxIndexTuplesPerPage - itemIndex == 2)
+ {
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ so->currPos.firstItem = MaxIndexTuplesPerPage - 2;
+ /*
+ * Actual itemIndex depends on in which direction do we
+ * advance if this direction is different from indexdir
+ */
+ so->currPos.itemIndex = MaxIndexTuplesPerPage -
+ (ScanDirectionIsForward(dir) ? 2 : 1);
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+
+ /*
+ * All of the closest items were found, so we can report
+ * success
+ */
+ return true;
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ prevItup = itup;
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ }
+
+ /* Not all of the closest items were found */
+ return false;
+}
+
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2251,3 +2852,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..21f169f5ea 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -72,7 +73,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,6 +116,45 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor
+ * direction. Skip to that tuple before returning the first tuple.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_tid(scandesc, direction))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -250,6 +290,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +542,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..6e649930c2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -92,7 +93,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +117,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +129,42 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_slot(scandesc, direction, slot))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +187,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +945,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9a4f3e8e4b..b09f8982ff 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 2565dcf296..3d33160cde 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f2325694c5..1805bbc07d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5ee9ee6595..53930af364 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4834,6 +4834,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->query_uniquekeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just a
+ * number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position of
+ * a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->query_uniquekeys)
+ {
+ UniqueKey *uniquekey = (UniqueKey *) lfirst(lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(uniquekey->eq_clause->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a, b)
+ * with arbitrary one of c, so if the choosed c does not
+ * match the qual and there is any c that matches the qual,
+ * we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List*) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ac0b937895..6bde8a8647 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2916,6 +2916,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 98e99481c6..7a6db36a57 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -269,6 +269,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fc463601ff..4f061f2e14 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -911,6 +911,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cfad86c02a..21d1effe61 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c28e..c775554f0b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -663,6 +663,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -777,6 +780,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -801,6 +806,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ec78491f6..9fbb822653 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c1d6f33fc0..754f224817 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -839,6 +839,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1189,6 +1190,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1201,6 +1205,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..04e871ae83 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 374cac157b..56bb16d589 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -201,6 +201,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..dcae34e9c0 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,437 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..22222592ee 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,166 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
--
2.21.0
Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet.
In my previous review I mentioned that queries of the form:
select distinct on(a) a,b from a where a=1;
do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c
/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)
The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query?
Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]/messages/by-id/26641.1564778586@sss.pgh.pa.us. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple.
Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);
_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;
_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;
The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate:
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;
a | b
---+---
1 | 1
2 | 1
3 | 1
4 | 1
5 | 1
(5 rows)
delete from a where a=2 and b=1;
DELETE 1
select distinct on (a) a,b from a order by a,b;
a | b
---+-----
1 | 1
2 | 249 ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page.
3 | 1
4 | 1
5 | 1
(5 rows)
?
-Floris
On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee <florisvannee@optiver.com>
wrote:The root->distinct_pathkeys is already filtered for redundant keys, so column
'a' is not in there anymore. Still, it'd be much faster to use the skip scan
here, because a regular scan will scan all entries with a=1, even though
we're really only interested in the first one. In previous versions, this
would be fixed by changing the check in planner.c to use
root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things
change a bit now that the patch is rebased on the unique-keys patch. Would it
be valid to change this check to root->query_uniquekeys != NIL to consider
skip scans also for this query?
[including a commentary from Jesper]
On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Yes, the check should be for that. However, the query in question
doesn't have any query_pathkeys, and hence query_uniquekeys in
standard_qp_callback(), so therefore it isn't supported.
Your scenario is covered by one of the test cases in case the
functionality is supported. But, I think that is outside the scope of
the current patch.
However, currently, the patch just reads the closest and then doesn't
consider this page at all anymore, if the first tuple skipped to turns out to
be not visible. Consider the following sql to illustrate:
For the records, the purpose of `_bt_read_closest` is not so much to reduce
amount of data we read from a page, but more to correctly handle those
situations we were discussing before with reading forward/backward in cursors,
since for that in some cases we need to remember previous values for stepping
to the next. I've limited number of items, fetched in this function just
because I was misled by having a check for dead tuples in `_bt_skip`. Of course
we can modify it to read a whole page and leave visibility check for the code
after `index_getnext_tid` (although in case if we know that all tuples on this
page are visilbe I guess it's not strictly necessary, but we still get
improvement from skipping itself).
Yes, the check should be for that. However, the query in question
doesn't have any query_pathkeys, and hence query_uniquekeys in
standard_qp_callback(), so therefore it isn't supported
Your scenario is covered by one of the test cases in case the
functionality is supported. But, I think that is outside the scope of
the current patch.
Ah alright, thanks. That makes it clear why it doesn't work.
From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.
Something else I just noticed which I'm just writing here for awareness; I don't think it's that pressing at the moment and can be left to another patch. When there are multiple indices on a table the planner gets confused and doesn't select an index-only skip scan even though it could. I'm guessing it just takes the first available index based on the DISTINCT clause and then doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
QUERY PLAN
--------------------------------------------------------------------
Index Scan using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=12)
Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.
For the records, the purpose of `_bt_read_closest` is not so much to reduce
amount of data we read from a page, but more to correctly handle those
situations we were discussing before with reading forward/backward in cursors,
since for that in some cases we need to remember previous values for stepping
to the next. I've limited number of items, fetched in this function just
because I was misled by having a check for dead tuples in `_bt_skip`. Of course
we can modify it to read a whole page and leave visibility check for the code
after `index_getnext_tid` (although in case if we know that all tuples on this
page are visilbe I guess it's not strictly necessary, but we still get
improvement from skipping itself).
I understand and I agree - primary purpose why we chose this function was to make it work correctly. I don't think it would be something for this patch to use the optimization of partially reading a page. My point was however, if this optimization was allowed in a future patch, it would have great performance benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good to take a close look at the implementation of this function then, because messing around with the previous/next is also not trivial. I think the current implementation also has a problem when the item that is skipped to, is the first item on the page. Eg. (this depends on page size)
postgres=# drop table if exists b; create table b as select a,b from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# select distinct on(a) a,b from b;
a | b
---+---
1 | 1
2 | 2 <-- (2,1) is the first item on the page and doesn't get selected by read_closest function. it returns the second item on page which is (2,2)
3 | 2
4 | 2
5 | 2
(5 rows)
-Floris
On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Of course we can modify it to read a whole page and leave visibility check
for the code after `index_getnext_tid` (although in case if we know that all
tuples on this page are visilbe I guess it's not strictly necessary, but we
still get improvement from skipping itself).
Sorry for long delay. Here is more or less what I had in mind. After changing
read_closest to read the whole page I couldn't resist to just merge it into
readpage itself, since it's basically the same. It could raise questions about
performance and how intrusive this patch is, but I hope it's not that much of a
problem (in the worst case we can split it back). I've also added few tests for
the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
put into reviewing!
Attachments:
v24-0001-Index-skip-scan.patchapplication/octet-stream; name=v24-0001-Index-skip-scan.patchDownload
From dde70f1593cc6b72916855674cfcb3604b1e4524 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v24] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 469 ++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 ++
src/backend/executor/nodeIndexonlyscan.c | 46 +-
src/backend/executor/nodeIndexscan.c | 43 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 91 +++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 482 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 178 +++++++
41 files changed, 1659 insertions(+), 33 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..3edd12dd27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4413,6 +4413,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..a47465dfd1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,6 +28,9 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
+static bool _bt_readpage_internal(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum, ScanDirection indexdir,
+ bool keepPrev);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
@@ -37,7 +40,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1373,6 +1379,315 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * This function in conjunction with _bt_readpage_internal handles them all.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+ OffsetNumber startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage_internal(scan, dir, offnum, indexdir, true);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case, advance forward and read also forward. At this moment we
+ * are at the next distinct key at the beginning of the series. Go back one
+ * step and let _bt_readpage_internal figure out about index condition.
+ */
+ if (ScanDirectionIsForward(dir) && ScanDirectionIsForward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_readpage_internal do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ _bt_readpage_internal(scan, dir, offnum, dir, true);
+
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset = startOffset;
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage_internal returns already found item */
+ if (_bt_readpage_internal(scan, dir, offnum, indexdir, true))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage_internal(scan, dir, offnum, indexdir, true))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -1394,12 +1709,33 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
*/
static bool
_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ return _bt_readpage_internal(scan, dir, offnum,
+ NoMovementScanDirection, false);
+}
+
+/*
+ * _bt_readpage_internal() -- worker function for _bt_readpage
+ *
+ * Besides regular readpage functionality this function allows to save the
+ * first item before those that we would normally save in _bt_readpage. This
+ * is used for _bt_skip.
+ *
+ * For than caller needs to set keepPrev to true. Since the definition of
+ * "previous" in case of cursor depends also on the index direction, one needs
+ * to provide it as argument as well.
+ */
+static bool
+_bt_readpage_internal(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum, ScanDirection indexdir, bool keepPrev)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page;
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ IndexTuple prevItup = NULL;
+ OffsetNumber prevOffnum = InvalidOffsetNumber;
int itemIndex;
bool continuescan;
int indnatts;
@@ -1456,7 +1792,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
Assert(BTScanPosIsPinned(so->currPos));
- if (ScanDirectionIsForward(dir))
+ if (ScanDirectionIsForward(keepPrev ? indexdir : dir))
{
/* load items[] in ascending order */
itemIndex = 0;
@@ -1482,14 +1818,42 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
- /* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (keepPrev)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsBackward(dir) || itemIndex >= 2)
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+ itemIndex++;
+
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
break;
+ /* Save previous tuple and offset */
+ prevItup = itup;
+ prevOffnum = offnum;
+
offnum = OffsetNumberNext(offnum);
}
@@ -1520,7 +1884,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex <= MaxIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
+ if (keepPrev)
+ so->currPos.itemIndex = ScanDirectionIsForward(dir) ? 0 : 1;
+ else
+ so->currPos.itemIndex = 0;
}
else
{
@@ -1566,9 +1933,34 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
&continuescan);
if (passes_quals && tuple_alive)
{
- /* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (keepPrev)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsForward(dir) ||
+ MaxIndexTuplesPerPage - itemIndex >= 2)
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ }
+ else
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
}
if (!continuescan)
{
@@ -1577,13 +1969,21 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
break;
}
+ /* Save previous tuple and offset */
+ prevItup = itup;
+ prevOffnum = offnum;
+
offnum = OffsetNumberPrev(offnum);
}
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ if (keepPrev)
+ so->currPos.itemIndex = MaxIndexTuplesPerPage -
+ (ScanDirectionIsForward(dir) ? 2 : 1);
+ else
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -2244,3 +2644,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..21f169f5ea 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -72,7 +73,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,6 +116,45 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor
+ * direction. Skip to that tuple before returning the first tuple.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_tid(scandesc, direction))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -250,6 +290,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +542,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..6e649930c2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -92,7 +93,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +117,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +129,42 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_slot(scandesc, direction, slot))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +187,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->ioss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +945,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..2ff9625533 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2213,6 +2215,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..70c1df47a4 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..6e54446b29 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..688dcca4f1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3622,12 +3622,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4814,6 +4823,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->uniq_distinct_pathkeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just a
+ * number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position of
+ * a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a, b)
+ * with arbitrary one of c, so if the choosed c does not
+ * match the qual and there is any c that matches the qual,
+ * we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List*) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..1de6ae8dcc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2904,6 +2904,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..34fbc27716 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..9e5b74b6de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -910,6 +910,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..88f9890780 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..6d441a4696 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..ecd40aad7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..ff11c17cca 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -833,6 +838,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1169,6 +1175,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1181,6 +1190,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..04e871ae83 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..7edcf4e689 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -200,6 +200,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..b3a438e0a9 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,485 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..52c93c5159 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,181 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
--
2.21.0
Sorry for long delay. Here is more or less what I had in mind. After changing
read_closest to read the whole page I couldn't resist to just merge it into
readpage itself, since it's basically the same. It could raise questions about>
performance and how intrusive this patch is, but I hope it's not that much of a
problem (in the worst case we can split it back). I've also added few tests for
the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
put into reviewing!
Putting it into one function makes sense I think. Looking at the patch, I think in general there are some good improvements in there.
I'm afraid I did manage to find another incorrect query result though, having to do with the keepPrev part and skipping to the first tuple on an index page:
postgres=# drop table if exists b; create table b as select a,b::int2 b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b,c); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# set enable_indexskipscan=1;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
a | b | c
---+---+---
1 | 2 | 0
2 | 4 | 0
3 | 4 | 0
4 | 4 | 0
5 | 4 | 0
(5 rows)
postgres=# set enable_indexskipscan=0;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
a | b | c
---+---+---
1 | 2 | 0
2 | 2 | 0
3 | 2 | 0
4 | 2 | 0
5 | 2 | 0
(5 rows)
-Floris
On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote:
I'm afraid I did manage to find another incorrect query result though
Yes, it's an example of what I was mentioning before, that the current modified
implementation of `_bt_readpage` wouldn't work well in case of going between
pages. So far it seems that the only problem we can have is when previous and
next items located on a different pages. I've checked how this issue can be
avoided, I hope I will post a new version relatively soon.
Surely it isn't right to add members prefixed with "ioss_" to
struct IndexScanState.
I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make
more sense to implement index_skip() to return the first tuple if the
scan is just starting? (I know little about executor, apologies if this
is a stupid question.)
It would be good to get more knowledgeable people to review this patch.
It's clearly something we want, yet it's been there for a very long
time.
Thanks
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Sep 2, 2019 at 3:28 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote:
I'm afraid I did manage to find another incorrect query result though
Yes, it's an example of what I was mentioning before, that the current modified
implementation of `_bt_readpage` wouldn't work well in case of going between
pages. So far it seems that the only problem we can have is when previous and
next items located on a different pages. I've checked how this issue can be
avoided, I hope I will post a new version relatively soon.
Here is the version in which stepping between the pages works better. It seems
sufficient to fix the case you've mentioned before, but for that we need to
propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
can't say I like this solution. I have an idea that maybe it would be simpler
to teach the code after index_skip to not do `_bt_next` right after one skip
happened before. It should immediately elliminate several hacks from index skip
itself, so I'll try to pursue this idea.
On Wed, Sep 4, 2019 at 10:45 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thank you for checking it out!
Surely it isn't right to add members prefixed with "ioss_" to
struct IndexScanState.
Yeah, sorry. I've incorporated IndexScan support originally only to show that
it's possible (with some limitations), but after that forgot to clean up. Now
those fields are renamed.
I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make
more sense to implement index_skip() to return the first tuple if the
scan is just starting? (I know little about executor, apologies if this
is a stupid question.)
I'm not entirely sure, which exactly part do you mean? Now the first tuple is
returned by `_bt_first`, how would it help if index_skip will return it?
It would be good to get more knowledgeable people to review this patch.
It's clearly something we want, yet it's been there for a very long
time.
Sure, that would be nice.
Attachments:
v25-0001-Index-skip-scan.patchapplication/octet-stream; name=v25-0001-Index-skip-scan.patchDownload
From ee597d579b363ce39e9f4406f1c912e9ce3b4803 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v25] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 519 +++++++++++++++++-
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 46 +-
src/backend/executor/nodeIndexscan.c | 43 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 91 ++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 511 +++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 196 +++++++
41 files changed, 1752 insertions(+), 37 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..3edd12dd27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4413,6 +4413,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..0b7a1d3e56 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,16 +28,27 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
+static bool _bt_readpage_internal(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum, ScanDirection indexdir,
+ bool keepPrev);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_steppage_internal(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool keepPrev);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_readnextpage_internal(IndexScanDesc scan, BlockNumber blkno,
+ ScanDirection dir, ScanDirection indexdir,
+ bool keepPrev);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
@@ -1373,6 +1384,315 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * This function in conjunction with _bt_readpage_internal handles them all.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+ OffsetNumber startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Move back for _bt_next */
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ /* Now read the data */
+ keyFound = _bt_readpage_internal(scan, dir, offnum, indexdir, true);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case, advance forward and read also forward. At this moment we
+ * are at the next distinct key at the beginning of the series. Go back one
+ * step and let _bt_readpage_internal figure out about index condition.
+ */
+ if (ScanDirectionIsForward(dir) && ScanDirectionIsForward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_readpage_internal do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ /* One step back to find a previous value */
+ _bt_readpage_internal(scan, dir, offnum, dir, true);
+
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset = startOffset;
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage_internal returns already found item */
+ if (_bt_readpage_internal(scan, dir, offnum, indexdir, true))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage_internal(scan, dir, offnum, indexdir, true))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage_internal(scan, dir, indexdir, true))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -1394,12 +1714,33 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
*/
static bool
_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ return _bt_readpage_internal(scan, dir, offnum,
+ NoMovementScanDirection, false);
+}
+
+/*
+ * _bt_readpage_internal() -- worker function for _bt_readpage
+ *
+ * Besides regular readpage functionality this function allows to save the
+ * first item before those that we would normally save in _bt_readpage. This
+ * is used for _bt_skip.
+ *
+ * For than caller needs to set keepPrev to true. Since the definition of
+ * "previous" in case of cursor depends also on the index direction, one needs
+ * to provide it as argument as well.
+ */
+static bool
+_bt_readpage_internal(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum, ScanDirection indexdir, bool keepPrev)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page;
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ IndexTuple prevItup = NULL;
+ OffsetNumber prevOffnum = InvalidOffsetNumber;
int itemIndex;
bool continuescan;
int indnatts;
@@ -1456,7 +1797,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
Assert(BTScanPosIsPinned(so->currPos));
- if (ScanDirectionIsForward(dir))
+ if (ScanDirectionIsForward(keepPrev ? indexdir : dir))
{
/* load items[] in ascending order */
itemIndex = 0;
@@ -1482,14 +1823,42 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
- /* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (keepPrev)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsBackward(dir) || itemIndex >= 2)
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+ itemIndex++;
+
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
break;
+ /* Save previous tuple and offset */
+ prevItup = itup;
+ prevOffnum = offnum;
+
offnum = OffsetNumberNext(offnum);
}
@@ -1517,10 +1886,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
+ /* If there are no items saved, but we have a previous one, save it */
+ if (keepPrev && itemIndex == 0 && prevItup != NULL )
+ {
+ _bt_saveitem(so, itemIndex, offnum, prevItup);
+ itemIndex++;
+ }
+
Assert(itemIndex <= MaxIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
+ if (keepPrev)
+ so->currPos.itemIndex = ScanDirectionIsForward(dir) ? 0 : 1;
+ else
+ so->currPos.itemIndex = 0;
}
else
{
@@ -1566,9 +1945,34 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
&continuescan);
if (passes_quals && tuple_alive)
{
- /* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (keepPrev)
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ if (ScanDirectionIsForward(dir) ||
+ MaxIndexTuplesPerPage - itemIndex >= 2)
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else if (prevItup != NULL)
+ {
+ /*
+ * Save the current item and the previous, even if the
+ * latter does not pass scan key conditions
+ */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ }
+ else
+ {
+ /* tuple passes all scan key conditions, so remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
}
if (!continuescan)
{
@@ -1577,16 +1981,34 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
break;
}
+ /* Save previous tuple and offset */
+ prevItup = itup;
+ prevOffnum = offnum;
+
offnum = OffsetNumberPrev(offnum);
}
+ /* If there are no items saved, but we have a previous one, save it */
+ if (keepPrev && itemIndex == 0 && prevItup != NULL )
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, prevOffnum, prevItup);
+ }
+
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ if (keepPrev)
+ so->currPos.itemIndex = MaxIndexTuplesPerPage -
+ (ScanDirectionIsForward(dir) ? 2 : 1);
+ else
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
}
- return (so->currPos.firstItem <= so->currPos.lastItem);
+ if (keepPrev)
+ return (so->currPos.firstItem < so->currPos.lastItem);
+ else
+ return (so->currPos.firstItem <= so->currPos.lastItem);
}
/* Save an index item into so->currPos.items[itemIndex] */
@@ -1622,6 +2044,14 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
+ return _bt_steppage_internal(scan, dir, NoMovementScanDirection, false);
+}
+
+static bool
+_bt_steppage_internal(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool keepPrev)
+{
+
BTScanOpaque so = (BTScanOpaque) scan->opaque;
BlockNumber blkno = InvalidBlockNumber;
bool status = true;
@@ -1707,7 +2137,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
}
}
- if (!_bt_readnextpage(scan, blkno, dir))
+ if (!_bt_readnextpage_internal(scan, blkno, dir, indexdir, keepPrev))
return false;
/* Drop the lock, and maybe the pin, on the current page */
@@ -1729,6 +2159,16 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
+ return _bt_readnextpage_internal(scan, blkno, dir,
+ NoMovementScanDirection, false);
+}
+
+static bool
+_bt_readnextpage_internal(IndexScanDesc scan, BlockNumber blkno,
+ ScanDirection dir, ScanDirection indexdir,
+ bool keepPrev)
+{
+
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel;
Page page;
@@ -1764,7 +2204,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (_bt_readpage_internal(scan, dir, P_FIRSTDATAKEY(opaque),
+ indexdir, keepPrev))
break;
}
else if (scan->parallel_scan != NULL)
@@ -1866,7 +2307,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (_bt_readpage_internal(scan, dir, PageGetMaxOffsetNumber(page),
+ indexdir, keepPrev))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2244,3 +2686,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..21f169f5ea 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -72,7 +73,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,6 +116,45 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->ioss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor
+ * direction. Skip to that tuple before returning the first tuple.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)
+ {
+ if (index_getnext_tid(scandesc, direction))
+ {
+ node->ioss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->ioss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ startscan, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
@@ -250,6 +290,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +542,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..0a09f8ed92 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
/*
* extract necessary information from index scan node
@@ -92,7 +93,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +117,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,6 +129,42 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ */
+ if (node->iss_SkipPrefixSize > 0)
+ {
+ bool startscan = false;
+
+ /*
+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexscan->indexorderdir < 0 &&
+ !node->iss_FirstTupleEmitted)
+ {
+ if (index_getnext_slot(scandesc, direction, slot))
+ {
+ node->iss_FirstTupleEmitted = true;
+ startscan = true;
+ }
+ }
+
+ if (node->iss_FirstTupleEmitted &&
+ !index_skip(scandesc, direction, indexscan->indexorderdir,
+ startscan, node->iss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset iss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -149,6 +187,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +945,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..2ff9625533 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2213,6 +2215,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..70c1df47a4 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..6e54446b29 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..688dcca4f1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3622,12 +3622,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4814,6 +4823,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->uniq_distinct_pathkeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just a
+ * number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position of
+ * a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a, b)
+ * with arbitrary one of c, so if the choosed c does not
+ * match the qual and there is any c that matches the qual,
+ * we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List*) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..1de6ae8dcc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2904,6 +2904,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..34fbc27716 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..9e5b74b6de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -910,6 +910,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..88f9890780 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..6d441a4696 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..4d2e994695 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..ff11c17cca 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -833,6 +838,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1169,6 +1175,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1181,6 +1190,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..04e871ae83 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..7edcf4e689 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -200,6 +200,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f0e92a99dd 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,514 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan mode: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..fddd0256ff 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,199 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0
On 2019-Sep-05, Dmitry Dolgov wrote:
Here is the version in which stepping between the pages works better. It seems
sufficient to fix the case you've mentioned before, but for that we need to
propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
can't say I like this solution. I have an idea that maybe it would be simpler
to teach the code after index_skip to not do `_bt_next` right after one skip
happened before. It should immediately elliminate several hacks from index skip
itself, so I'll try to pursue this idea.
Cool.
I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:
/*
* If advancing direction is different from index direction, we must
* skip right away, but _bt_skip requires a starting point.
*/
if (direction * indexscan->indexorderdir < 0 &&
!node->iss_FirstTupleEmitted)
Surely there's a better way to code that?
I think "scanstart" needs more documentation, both in the SGML docs as
well as the code comments surrounding it.
Please disregard my earlier comment about FirstTupleEmitted. I was
thinking that index_skip would itself emit a tuple (ie. call some
"getnext" internally) rather than just repositioning. There might still
be some more convenient way to represent this, but I have no immediate
advice.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 5, 2019 at 9:41 PM Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> wrote:
On 2019-Sep-05, Dmitry Dolgov wrote:
Here is the version in which stepping between the pages works better. It seems
sufficient to fix the case you've mentioned before, but for that we need to
propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
can't say I like this solution. I have an idea that maybe it would be simpler
to teach the code after index_skip to not do `_bt_next` right after one skip
happened before. It should immediately elliminate several hacks from index skip
itself, so I'll try to pursue this idea.Cool.
Here it is. Since now the code after index_skip knows whether to do
index_getnext or not, it's possible to use unmodified `_bt_readpage` /
`_bt_steppage`. To achieve that there is a flag that indicated whether or not
we were skipping to the current item (I guess it's possible to implement it
without such a flag, but the at the end result looked more ugly as for me). On
the way I've simplified few things, and all the tests we accumulated before are
still passing. I'm almost sure it's possible to implement some parts of the
code more elegant, but don't see yet how.
I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:
Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?
I think "scanstart" needs more documentation, both in the SGML docs as
well as the code comments surrounding it.
I was able to remove it after another round of simplification.
Attachments:
v26-0001-Index-skip-scan.patchapplication/octet-stream; name=v26-0001-Index-skip-scan.patchDownload
From ce5addc1b99dda221f7c79f975ca9b7c9701f188 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v26] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 354 ++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 46 +-
src/backend/executor/nodeIndexscan.c | 46 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 91 +++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 511 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 196 +++++++
41 files changed, 1602 insertions(+), 25 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..3edd12dd27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4413,6 +4413,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..b4e6b7555b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,6 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
@@ -1373,6 +1377,307 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case is when both directions are forward, when we are already
+ * at the next distinct key at the beginning of the series (so everything
+ * else would be done in _bt_readpage)
+ *
+ * The case when both directions are backwards is also simple, but we need
+ * to go one step back, since we need a last element from the previous
+ * series.
+ */
+ if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_readpage do everything else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* One step back to find a previous value */
+ _bt_readpage(scan, dir, offnum);
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset, startOffset;
+ nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage returns already found item */
+ if (_bt_readpage(scan, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2244,3 +2549,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..d65242d7d8 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+ /* tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -72,7 +77,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,14 +120,47 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have returned,
+ * but in reversed order. In other words, return the last or first scanned
+ * tuple in a DISTINCT set, depending on a cursor direction. Due to that we
+ * skip also when the first tuple wasn't emitted yet, but the directions
+ * are opposite.
+ */
+ if (node->ioss_SkipPrefixSize > 0 &&
+ (node->ioss_FirstTupleEmitted || (direction * indexonlyscan->indexorderdir < 0)))
+ {
+ if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ !node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ tid = &scandesc->xs_heaptid;
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
- while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+ while (skipped || (tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* We can skip the heap fetch if the TID references a heap page on
@@ -250,6 +288,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +540,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..dd3122674f 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,11 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+ /* tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -92,7 +97,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +121,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,12 +133,45 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have returned,
+ * but in reversed order. In other words, return the last or first scanned
+ * tuple in a DISTINCT set, depending on a cursor direction. Due to that we
+ * skip also when the first tuple wasn't emitted yet, but the directions
+ * are opposite.
+ */
+ if (node->iss_SkipPrefixSize > 0 &&
+ (node->iss_FirstTupleEmitted || (direction * indexscan->indexorderdir < 0)))
+ {
+ if(!index_skip(scandesc, direction, indexscan->indexorderdir,
+ !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset iss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ index_fetch_heap(scandesc, slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while (skipped || index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* If the index was lossy, we have to recheck the index quals using
@@ -149,6 +188,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +946,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..2ff9625533 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2213,6 +2215,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..70c1df47a4 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..6e54446b29 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..688dcca4f1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3622,12 +3622,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4814,6 +4823,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->uniq_distinct_pathkeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just a
+ * number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position of
+ * a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a, b)
+ * with arbitrary one of c, so if the choosed c does not
+ * match the qual and there is any c that matches the qual,
+ * we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List*) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..1de6ae8dcc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2904,6 +2904,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..34fbc27716 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..9e5b74b6de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -910,6 +910,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..88f9890780 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..6d441a4696 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..4d2e994695 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..ff11c17cca 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -833,6 +838,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1169,6 +1175,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1181,6 +1190,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..04e871ae83 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..7edcf4e689 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -200,6 +200,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f0e92a99dd 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,514 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan mode: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..fddd0256ff 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,199 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0
On 2019-Sep-22, Dmitry Dolgov wrote:
I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?
Maybe use the ^ operator?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190923020204.GA2781@alvherre.pgsql>
On 2019-Sep-22, Dmitry Dolgov wrote:
I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?Maybe use the ^ operator?
It's not a logical operator but a bitwise arithmetic operator,
which cannot be used if the operands is guaranteed to be 0 or 1
(in integer). In a-kind-of-standard, but hacky way, "(!a != !b)"
works as desired since ! is a logical operator.
Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
some good way.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 24 Sep 2019 17:35:47 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190924.173547.226622711.horikyota.ntt@gmail.com>
At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190923020204.GA2781@alvherre.pgsql>
On 2019-Sep-22, Dmitry Dolgov wrote:
I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?Maybe use the ^ operator?
It's not a logical operator but a bitwise arithmetic operator,
which cannot be used if the operands is guaranteed to be 0 or 1
(in integer). In a-kind-of-standard, but hacky way, "(!a != !b)"
works as desired since ! is a logical operator.Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
some good way.
Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,
(dir1 != 0 && dir1 == -dir2) ?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2019-Sep-24, Kyotaro Horiguchi wrote:
Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,(dir1 != 0 && dir1 == -dir2) ?
Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
definition? :-)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190924120627.GA12454@alvherre.pgsql>
On 2019-Sep-24, Kyotaro Horiguchi wrote:
Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,(dir1 != 0 && dir1 == -dir2) ?
Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
definition? :-)
Yeah, sounds good to establish it as a part of ScanDirection's
definition.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Sep 25, 2019 at 3:03 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190924120627.GA12454@alvherre.pgsql>
On 2019-Sep-24, Kyotaro Horiguchi wrote:
Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,(dir1 != 0 && dir1 == -dir2) ?
Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
definition? :-)Yeah, sounds good to establish it as a part of ScanDirection's
definition.
Yep, this way looks better.
Attachments:
v27-0001-Index-skip-scan.patchapplication/octet-stream; name=v27-0001-Index-skip-scan.patchDownload
From dad364f8fa8b62661bce522ed7a3be3f0a3d4bbe Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v27] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 354 ++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 47 +-
src/backend/executor/nodeIndexscan.c | 47 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 91 +++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/access/sdir.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 511 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 196 +++++++
42 files changed, 1611 insertions(+), 25 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..3edd12dd27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4413,6 +4413,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..b4e6b7555b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,6 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
@@ -1373,6 +1377,307 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements
+ * in order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case is when both directions are forward, when we are already
+ * at the next distinct key at the beginning of the series (so everything
+ * else would be done in _bt_readpage)
+ *
+ * The case when both directions are backwards is also simple, but we need
+ * to go one step back, since we need a last element from the previous
+ * series.
+ */
+ if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise find
+ * previous distinct key and the beginning of it's series and read forward
+ * from there. To do so, go back one step, perform binary search to find
+ * the first item in the series and let _bt_readpage do everything else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ if (!scanstart)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* One step back to find a previous value */
+ _bt_readpage(scan, dir, offnum);
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* And now find the last item from the sequence for the current,
+ * value with the intention do OffsetNumberNext. As a result we
+ * end up on a first element from the sequence. */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing anything
+ * else. Otherwise find the next distinct key and the beginning of it's
+ * series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset, startOffset;
+ nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ while(nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage returns already found item */
+ if (_bt_readpage(scan, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ {
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ if (scan->xs_want_itup)
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2244,3 +2549,52 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts, i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low, high, compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..80c0e23383 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+ /* tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -72,7 +77,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,14 +120,48 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have returned,
+ * but in reversed order. In other words, return the last or first scanned
+ * tuple in a DISTINCT set, depending on a cursor direction. Due to that we
+ * skip also when the first tuple wasn't emitted yet, but the directions
+ * are opposite.
+ */
+ if (node->ioss_SkipPrefixSize > 0 &&
+ (node->ioss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ !node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset ioss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ tid = &scandesc->xs_heaptid;
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
- while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+ while (skipped || (tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* We can skip the heap fetch if the TID references a heap page on
@@ -250,6 +289,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +541,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..532b6182d8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,11 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+ /* tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -92,7 +97,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +121,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,12 +133,46 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have returned,
+ * but in reversed order. In other words, return the last or first scanned
+ * tuple in a DISTINCT set, depending on a cursor direction. Due to that we
+ * skip also when the first tuple wasn't emitted yet, but the directions
+ * are opposite.
+ */
+ if (node->iss_SkipPrefixSize > 0 &&
+ (node->iss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+ {
+ if(!index_skip(scandesc, direction, indexscan->indexorderdir,
+ !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+ {
+ /* Reached end of index. At this point currPos is invalidated,
+ * and we need to reset iss_FirstTupleEmitted, since otherwise
+ * after going backwards, reaching the end of index, and going
+ * forward again we apply skip again. It would be incorrect and
+ * lead to an extra skipped item. */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ index_fetch_heap(scandesc, slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while (skipped || index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* If the index was lossy, we have to recheck the index quals using
@@ -149,6 +189,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +947,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..2ff9625533 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2213,6 +2215,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..70c1df47a4 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..6e54446b29 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..688dcca4f1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3622,12 +3622,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4814,6 +4823,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->uniq_distinct_pathkeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just a
+ * number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position of
+ * a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a, b)
+ * with arbitrary one of c, so if the choosed c does not
+ * match the qual and there is any c that matches the qual,
+ * we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List*) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..1de6ae8dcc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2904,6 +2904,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..34fbc27716 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..9e5b74b6de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -910,6 +910,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..88f9890780 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..6d441a4696 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 664e72ef5d..dff90fada1 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
#define ScanDirectionIsForward(direction) \
((bool) ((direction) == ForwardScanDirection))
+/*
+ * ScanDirectionsAreOpposite
+ * True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+ ((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
#endif /* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..4d2e994695 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..ff11c17cca 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ distinctClause pathkeys, if any.
+ Used for index skip scan, since
+ redundant distinctClauses also must
+ be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -833,6 +838,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1169,6 +1175,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1181,6 +1190,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..04e871ae83 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,7 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexScan;
/* ----------------
@@ -432,6 +433,7 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..7edcf4e689 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -200,6 +200,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f0e92a99dd 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,514 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan mode: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..fddd0256ff 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,199 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0
On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
v27-0001-Index-skip-scan.patch
Some random thoughts on this:
* Is _bt_scankey_within_page() really doing the right thing within empty pages?
It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.
Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect. You can have a few hundred index entries for
each distinct value, and the page split logic added to Postgres 12 (by
commit fab25024) will still manage to "trap" each set of duplicates on
their own distinct leaf page. Leaf pages will have a high key that
looks like the values that appear on the page to the right. The goal
is for forward index scans to access the minimum number of leaf pages,
especially with low cardinality data and with multi-column indexes.
(See also: commit 29b64d1d)
A good way to see this for yourself is to get the Wisconsin Benchmark
tables (the tenk1 table and related tables from the regression tests)
populated using retail insertions. "CREATE TABLE __tenk1(like tenk1
including indexes); INSERT INTO __tenk1 SELECT * FROM tenk1;" is how I
like to set this up. Then you can see that we only access one leaf
page easily by forcing bitmap scans (i.e. "set enable* ..."), and
using "EXPLAIN (analyze, buffers) SELECT ... FROM __tenk1 WHERE ...",
where the SELECT query is a simple point lookup query (bitmap scans
happen to instrument the index buffer accesses in a way that makes it
absolutely clear how many index page buffers were touched). IIRC the
existing tenk1 indexes have no more than a few hundred duplicates for
each distinct value in all cases, so only one leaf page needs to be
accessed by simple "key = val" queries in all cases.
(I imagine that the "four" index you add in the regression test would
generally need to visit more than one leaf page for simple point
lookup queries, but in any case the high key is a useful way of
detecting a "break" in the values when indexing low cardinality data
-- these breaks are generally "aligned" to leaf page boundaries.)
I also like to visualize the keyspace of indexes when poking around at
that stuff, generally by using some of the queries that you can find
on the Wiki [1]https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index -- Peter Geoghegan.
* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).
* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.
* Suggest running pgindent on the patch.
[1]: https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index -- Peter Geoghegan
--
Peter Geoghegan
On Sat, Nov 2, 2019 at 11:56 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
v27-0001-Index-skip-scan.patch
Some random thoughts on this:
And now some more:
* I'm confused about this code in _bt_skip():
/*
* Andvance backward but read forward. At this moment we are at the next
* distinct key at the beginning of the series. In case if scan just
* started, we can read forward without doing anything else. Otherwise find
* previous distinct key and the beginning of it's series and read forward
* from there. To do so, go back one step, perform binary search to find
* the first item in the series and let _bt_readpage do everything else.
*/
else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
{
if (!scanstart)
{
_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);/* One step back to find a previous value */
_bt_readpage(scan, dir, offnum);
Why is it okay to call _bt_drop_lock_and_maybe_pin() like this? It
looks like that will drop the lock (but not the pin) on the same
buffer that you binary search with _bt_binsrch() (since the local
variable "buf" is also the buf in "so->currPos").
* It also seems a bit odd that you assume that the scan is
"scan->xs_want_itup", but then check that condition many times. Why
bother?
* Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
rather than just unlocking the buffer directly? We'll only drop the
pin for a scan that is "!scan->xs_want_itup", which is never the case
within _bt_skip().
I think that the macros and stuff that manage pins and buffer locks in
nbtsearch.c is kind of a disaster anyway [1]/messages/by-id/CAH2-Wz=m674-RKQdCG+jCD9QGzN1Kcg-FOdYw4-j+5_PfcHbpQ@mail.gmail.com -- Peter Geoghegan. Maybe there is some
value in trying to be consistent with existing nbtsearch.c code in
ways that aren't strictly necessary.
* Not sure why you need this code after throwing an error:
else
{
elog(ERROR, "Could not read closest index tuples: %d", offnum);
pfree(so->skipScanKey);
so->skipScanKey = NULL;
return false;
}
[1]: /messages/by-id/CAH2-Wz=m674-RKQdCG+jCD9QGzN1Kcg-FOdYw4-j+5_PfcHbpQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
v27-0001-Index-skip-scan.patchSome random thoughts on this:
Thanks a lot for the commentaries!
* Is _bt_scankey_within_page() really doing the right thing within empty pages?
It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.
Yes, you're right, that's an issue I need to fix.
Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect.
Can you please elaborate a bit more? I see that using high key will help
a forward index scans to access the minimum number of leaf pages, but
I'm not following how is it connected to the _bt_scankey_within_page? Or
is this commentary related in general to the whole implementation?
* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.* Suggest running pgindent on the patch.
Sure, I'll incorporate mentioned improvements into the next patch
version (hopefully soon).
And now some more:
* I'm confused about this code in _bt_skip():
Yeah, it shouldn't be there, but rather before _bt_next, that expects
unlocked buffer. Will fix.
* It also seems a bit odd that you assume that the scan is
"scan->xs_want_itup", but then check that condition many times. Why
bother?* Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
rather than just unlocking the buffer directly? We'll only drop the
pin for a scan that is "!scan->xs_want_itup", which is never the case
within _bt_skip().I think that the macros and stuff that manage pins and buffer locks in
nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
value in trying to be consistent with existing nbtsearch.c code in
ways that aren't strictly necessary.
Yep, I've seen this thread, but tried to be consistent with the
surrounding core style. Probably it indeed doesn't make sense.
* Not sure why you need this code after throwing an error:
else
{
elog(ERROR, "Could not read closest index tuples: %d", offnum);
pfree(so->skipScanKey);
so->skipScanKey = NULL;
return false;
}
Unfortunately this is just a leftover from a previous version. Sorry for
that, will get rid of it.
On Sun, Nov 03, 2019 at 05:45:47PM +0100, Dmitry Dolgov wrote:
* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.* Suggest running pgindent on the patch.
Sure, I'll incorporate mentioned improvements into the next patch
version (hopefully soon).
Here is the new version, that addresses mentioned issues.
* Is _bt_scankey_within_page() really doing the right thing within empty pages?
It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.Yes, you're right, that's an issue I need to fix.
If I didn't misunderstood something, for the purpose of this function it
makes sense to return false in the case of empty page. That's what I've
added into the patch.
Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect.Can you please elaborate a bit more? I see that using high key will help
a forward index scans to access the minimum number of leaf pages, but
I'm not following how is it connected to the _bt_scankey_within_page? Or
is this commentary related in general to the whole implementation?
This question is still open.
Attachments:
v28-0001-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload
From f0e287da04bc314dfba48f3bfb0c8bb224938ce1 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 3 Jul 2019 16:25:20 +0200
Subject: [PATCH v28] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 24 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 359 ++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 29 +
src/backend/executor/nodeIndexonlyscan.c | 51 +-
src/backend/executor/nodeIndexscan.c | 51 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/pathkeys.c | 84 ++-
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planner.c | 91 +++-
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/access/sdir.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 10 +
src/include/nodes/plannodes.h | 4 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/include/optimizer/paths.h | 4 +
src/test/regress/expected/create_index.out | 1 +
src/test/regress/expected/select_distinct.out | 511 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/create_index.sql | 2 +
src/test/regress/sql/select_distinct.sql | 196 +++++++
42 files changed, 1626 insertions(+), 25 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index cc1670934f..ab9f0a7177 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..3edd12dd27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4413,6 +4413,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..73b1b4fcf7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..567141046f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,30 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ In case if index scan is used to retrieve the distinct values of a column
+ efficiently, it can be not very efficient, since it requires to scan all
+ the equal values of a key. In such cases planner will consider apply index
+ skip scan approach, which is based on the idea of
+ <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
+ values of a key, as soon as a new value is found, it will search for a
+ larger value on the same index page, and if not found, restart the search
+ by looking for a larger value. This is much faster when the index has many
+ equal keys.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..233ea9e5ec 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..9817f34c34 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5cc30dac42..019e330cff 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca7..ae7a882571 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..d57de9dfa1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,6 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
@@ -1373,6 +1377,305 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements in
+ * order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case is when both directions are forward, when we are already
+ * at the next distinct key at the beginning of the series (so everything
+ * else would be done in _bt_readpage)
+ *
+ * The case when both directions are backwards is also simple, but we need
+ * to go one step back, since we need a last element from the previous
+ * series.
+ */
+ if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise
+ * find previous distinct key and the beginning of it's series and read
+ * forward from there. To do so, go back one step, perform binary search
+ * to find the first item in the series and let _bt_readpage do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ if (!scanstart)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* One step back to find a previous value */
+ _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /*
+ * And now find the last item from the sequence for the
+ * current, value with the intention do OffsetNumberNext. As a
+ * result we end up on a first element from the sequence.
+ */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Andvance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing
+ * anything else. Otherwise find the next distinct key and the beginning
+ * of it's series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinc key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset,
+ startOffset;
+
+ nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ while (nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage returns already found item */
+ if (_bt_readpage(scan, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2244,3 +2547,59 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts,
+ i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low,
+ high,
+ compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ if (unlikely(high < low))
+ return false;
+
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..ad500de12b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1620,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ {
+ ExplainPropertyBool("Skip scan mode", true, es);
+ }
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 652a9afc75..2aae3daae4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -72,7 +79,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,14 +122,50 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->ioss_SkipPrefixSize > 0 &&
+ (node->ioss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ !node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset ioss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ tid = &scandesc->xs_heaptid;
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
- while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+ while (skipped || (tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* We can skip the heap fetch if the TID references a heap page on
@@ -250,6 +293,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -500,6 +545,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..3a7f5e6b8b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +123,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,12 +135,48 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->iss_SkipPrefixSize > 0 &&
+ (node->iss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+ !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset iss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ index_fetch_heap(scandesc, slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while (skipped || index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* If the index was lossy, we have to recheck the index quals using
@@ -149,6 +193,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -906,6 +951,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7cfd..20495c9e52 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..2ff9625533 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -2213,6 +2215,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
+ WRITE_NODE_FIELD(uniq_distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
WRITE_NODE_FIELD(processed_tlist);
WRITE_NODE_FIELD(minmax_aggs);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..70c1df47a4 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,30 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Part of pathkey_is_redundant, that is reponsible for the case, when the
+ * new pathkey's equivalence class is the same as that of any existing
+ * member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already used in list, then redundant */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +160,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1113,53 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_distinctclauses
+ * Generate a pathkeys list for distinct clauses, that represents the sort
+ * order specified by a list of SortGroupClauses. Similar to
+ * make_pathkeys_for_sortclauses, but allows to specify if we need to
+ * check the full redundancy, or just uniqueness.
+ */
+List *
+make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *distinctclauses,
+ List *tlist, bool checkRedundant)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, distinctclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ /* Canonical form eliminates redundant ordering keys */
+ if (checkRedundant)
+ {
+ if (!pathkey_is_redundant(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ else
+ {
+ if (!pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+ }
+ return pathkeys;
+}
+
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..6e54446b29 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2905,7 +2907,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2916,7 +2919,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5179,7 +5183,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5196,6 +5201,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5208,7 +5214,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5223,6 +5230,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..ed52139839 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -505,6 +505,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->group_pathkeys = NIL;
root->window_pathkeys = NIL;
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..644b8ad356 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3622,12 +3622,21 @@ standard_qp_callback(PlannerInfo *root, void *extra)
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
+ {
+ root->uniq_distinct_pathkeys =
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, false);
root->distinct_pathkeys =
- make_pathkeys_for_sortclauses(root,
- parse->distinctClause,
- tlist);
+ make_pathkeys_for_distinctclauses(root,
+ parse->distinctClause,
+ tlist, true);
+ }
else
+ {
root->distinct_pathkeys = NIL;
+ root->uniq_distinct_pathkeys = NIL;
+ }
root->sort_pathkeys =
make_pathkeys_for_sortclauses(root,
@@ -4814,6 +4823,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->uniq_distinct_pathkeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just
+ * a number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position
+ * of a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->uniq_distinct_pathkeys)
+ {
+ PathKey *pathKey = lfirst_node(PathKey, lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(pathKey->pk_eclass->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a,
+ * b) with arbitrary one of c, so if the choosed c does
+ * not match the qual and there is any c that matches the
+ * qual, we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List *) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..e248584fb1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2904,6 +2904,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..34fbc27716 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..9e5b74b6de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -910,6 +910,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..88f9890780 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..fd1595d3be 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..e5ec5b07c8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -173,6 +173,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..c1d6ff98f5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 664e72ef5d..dff90fada1 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
#define ScanDirectionIsForward(direction) \
((bool) ((direction) == ForwardScanDirection))
+/*
+ * ScanDirectionsAreOpposite
+ * True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+ ((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
#endif /* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f42189d2bf..93f9cdc33d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1377,6 +1377,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1406,6 +1408,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1424,6 +1428,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..3a21118ad1 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -298,6 +298,11 @@ struct PlannerInfo
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
+ List *uniq_distinct_pathkeys; /* unique, but potentially redundant
+ * distinctClause pathkeys, if any.
+ * Used for index skip scan, since
+ * redundant distinctClauses also must
+ * be considered */
List *sort_pathkeys; /* sortClause pathkeys, if any */
List *part_schemes; /* Canonicalised partition schemes used in the
@@ -833,6 +838,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1169,6 +1175,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1181,6 +1190,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..f09c8c43a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,8 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexScan;
/* ----------------
@@ -432,6 +434,8 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..7edcf4e689 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -200,6 +200,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..a782d12a50 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -209,6 +209,10 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_distinctclauses(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist,
+ bool checkRedundant);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f9..4f5c82f49d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -19,6 +19,7 @@ CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..f0e92a99dd 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,514 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+ four
+------
+ 1
+(1 row)
+
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan mode: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ four | ten
+------+-----
+ 0 | 0
+ 1 | 9
+ 2 | 0
+ 3 | 1
+(4 rows)
+
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ four | ten
+------+-----
+ 1 | 9
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+ QUERY PLAN
+--------------------------------------
+ Index Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+ QUERY PLAN
+---------------------------------------------------
+ Result
+ -> Unique
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (four = 1)
+ -> Bitmap Index Scan on tenk1_four
+ Index Cond: (four = 1)
+(6 rows)
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ four | ten
+------+-----
+ 0 | 0
+ 0 | 2
+ 0 | 4
+ 0 | 6
+ 0 | 8
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_four_ten on tenk1
+ Skip scan mode: true
+ Index Cond: (four = 0)
+(3 rows)
+
+DROP INDEX tenk1_four_ten;
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ four
+------
+ 0
+ 2
+(2 rows)
+
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ four | ten
+------+-----
+ 0 | 2
+ 2 | 2
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+ QUERY PLAN
+-----------------------------------------------
+ Index Only Scan using tenk1_ten_four on tenk1
+ Skip scan mode: true
+ Index Cond: (ten = 2)
+(3 rows)
+
+DROP INDEX tenk1_ten_four;
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+ four | four
+------+------
+ 0 | 0
+ 2 | 2
+(2 rows)
+
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+ four | ?column?
+------+----------
+ 2 | 1
+ 0 | 1
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+ QUERY PLAN
+-------------------------------------------
+ Index Only Scan using tenk1_four on tenk1
+ Skip scan mode: true
+(2 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+FETCH FROM c;
+ four
+------
+ 0
+(1 row)
+
+FETCH BACKWARD FROM c;
+ four
+------
+(0 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+FETCH 5 FROM c;
+ four
+------
+ 0
+ 1
+ 2
+ 3
+(4 rows)
+
+FETCH BACKWARD 5 FROM c;
+ four
+------
+ 3
+ 2
+ 1
+ 0
+(4 rows)
+
+END;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan mode: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410..a3be42a725 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -26,6 +26,8 @@ CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk1_four ON tenk1 (four);
+
CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..fddd0256ff 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,199 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+SELECT DISTINCT four FROM tenk1;
+SELECT DISTINCT four FROM tenk1 WHERE four = 1;
+SELECT DISTINCT four FROM tenk1 ORDER BY four DESC;
+
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, hundred, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) hundred
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 ORDER BY four;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (four) four, ten
+FROM tenk1 WHERE four = 1 ORDER BY four;
+
+-- check colums order
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+CREATE INDEX tenk1_four_ten on tenk1 (four, ten);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+-- test uniq_distinct_pathkeys
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE four = 0;
+
+DROP INDEX tenk1_four_ten;
+
+CREATE INDEX tenk1_ten_four on tenk1 (ten, four);
+
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (four, ten) four, ten FROM tenk1 WHERE ten = 2;
+
+DROP INDEX tenk1_ten_four;
+
+-- check projection case
+SELECT DISTINCT four, four FROM tenk1 WHERE ten = 2;
+SELECT DISTINCT four, 1 FROM tenk1 WHERE ten = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT four FROM tenk1;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT four FROM tenk1;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+FETCH 5 FROM c;
+FETCH BACKWARD 5 FROM c;
+
+END;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0
Hi,
I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.
Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.
The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.
A couple more comments:
1) pathkey_is_unique
The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).
My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).
2) explain
I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:
/* try not to be too chatty about this in text mode */
if (es->format != EXPLAIN_FORMAT_TEXT ||
(es->verbose && ((Join *) plan)->inner_unique))
ExplainPropertyBool("Inner Unique",
((Join *) plan)->inner_unique,
es);
break;
I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.
3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries using
DISTINCT a
DISTINCT a,b
DISTINCT a,b,c
but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).
That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:
-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)
Output: a, b, c
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.106 ms
Execution Time: 10.843 ms
(6 rows)
-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual time=802.067..802.612 rows=1331 loops=1)
Output: a, c, b
Group Key: t.a, t.c, t.b
-> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.010..355.361 rows=1000000 loops=1)
Output: a, b, c
Planning Time: 0.076 ms
Execution Time: 803.078 ms
(7 rows)
-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by a,b,c;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)
Output: a, c, b
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.053 ms
Execution Time: 12.632 ms
(6 rows)
This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
skipscan-review.patchtext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 73b1b4fcf7..94e09835b4 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,7 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
- amskip_function amskip; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 567141046f..efc5e41389 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1248,15 +1248,14 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
</indexterm>
<para>
- In case if index scan is used to retrieve the distinct values of a column
- efficiently, it can be not very efficient, since it requires to scan all
- the equal values of a key. In such cases planner will consider apply index
- skip scan approach, which is based on the idea of
- <firstterm>Loose index scan</firstterm>. Rather than scanning all equal
- values of a key, as soon as a new value is found, it will search for a
- larger value on the same index page, and if not found, restart the search
- by looking for a larger value. This is much faster when the index has many
- equal keys.
+ When the rows retrieved from an index scan are then deduplicated by
+ eliminating rows matching on a prefix of index keys (e.g. when using
+ <literal>SELECT DISTINCT</literal>), the planner will consider
+ skipping groups of rows with a matching key prefix. When a row with
+ a particular prefix is found, remaining rows with the same key prefix
+ are skipped. The larger the number of rows with the same key prefix
+ rows (i.e. the lower the number of distinct key prefixes in the index),
+ the more efficient this is.
</para>
</sect2>
</sect1>
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 72dcd4d734..74786bf4a2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1403,6 +1403,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
*
* * Advancing backward and reading backward
* simple scan with order by desc
+ *
+ * TODO Explain why we search the current page first, then do lookup from root?
*/
bool
_bt_skip(IndexScanDesc scan, ScanDirection dir,
@@ -1556,7 +1558,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
}
/*
- * Andvance forward but read backward. At this moment we are at the next
+ * Advance forward but read backward. At this moment we are at the next
* distinct key at the beginning of the series. In case if scan just
* started, we can go one step back and read forward without doing
* anything else. Otherwise find the next distinct key and the beginning
@@ -1564,7 +1566,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
*
* An interesting situation can happen if one of distinct keys do not pass
* a corresponding index condition at all. In this case reading backward
- * can lead to a previous distinc key being found, creating a loop. To
+ * can lead to a previous distinct key being found, creating a loop. To
* avoid that check the value to be returned, and jump one more time if
* it's the same as at the beginning.
*/
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ad500de12b..b66296d6c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,7 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
-static void ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1049,7 +1049,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
* Can be used to print the skip prefix size.
*/
static void
-ExplainIndexSkipScanKeys(ExplainState *es, int skipPrefixSize)
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
{
if (skipPrefixSize > 0)
{
@@ -1380,7 +1380,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
- ExplainIndexSkipScanKeys(es, indexscan->indexskipprefixsize);
+ ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
@@ -1392,7 +1392,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
- ExplainIndexSkipScanKeys(es, indexonlyscan->indexskipprefixsize);
+ ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
@@ -1604,9 +1604,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
case T_IndexScan:
if (((IndexScan *) plan)->indexskipprefixsize > 0)
- {
- ExplainPropertyBool("Skip scan mode", true, es);
- }
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1621,9 +1619,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
break;
case T_IndexOnlyScan:
if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
- {
- ExplainPropertyBool("Skip scan mode", true, es);
- }
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 70c1df47a4..5683edcd2f 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -99,9 +99,12 @@ make_canonical_pathkey(PlannerInfo *root,
/*
* pathkey_is_unique
- * Part of pathkey_is_redundant, that is reponsible for the case, when the
- * new pathkey's equivalence class is the same as that of any existing
- * member of the pathkey list.
+ * Checks if the new pathkey's equivalence class is the same as that of
+ * any existing member of the pathkey list.
+ *
+ * FIXME This seems to be misnamed, i.e. it returns true/false in a way
+ * that contradicts the name. If there already is a matching EC. it says
+ * true, but that means "not unique" no?
*/
static bool
pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
@@ -165,6 +168,10 @@ pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
+ /*
+ * FIXME misnamed - this returns true when there is a matching EC, which
+ * however should ne "not unique" I think.
+ */
return pathkey_is_unique(new_pathkey, pathkeys);
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 644b8ad356..3f625938e8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3634,8 +3634,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
}
else
{
- root->distinct_pathkeys = NIL;
root->uniq_distinct_pathkeys = NIL;
+ root->distinct_pathkeys = NIL;
}
root->sort_pathkeys =
Hi Tomas,
On 11/10/19 4:18 PM, Tomas Vondra wrote:
I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.
Sorry about that !
The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.A couple more comments:
1) pathkey_is_unique
The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).
Yeah, you are correct - forgot to move that part from the _uniquekey
version of the patch.
2) explain
I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:/* try not to be too chatty about this in text mode */
if (es->format != EXPLAIN_FORMAT_TEXT ||
(es->verbose && ((Join *) plan)->inner_unique))
ExplainPropertyBool("Inner Unique",
((Join *) plan)->inner_unique,
es);
break;I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.
I think it is of benefit to see if skip scan kicks in, but used your
"Skip scan" suggestion.
3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries usingDISTINCT a
DISTINCT a,b
DISTINCT a,b,cbut not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25
rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)
Output: a, b, c
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.106 ms
Execution Time: 10.843 ms
(6 rows)-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual
time=802.067..802.612 rows=1331 loops=1)
Output: a, c, b
Group Key: t.a, t.c, t.b
-> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000
width=12) (actual time=0.010..355.361 rows=1000000 loops=1)
Output: a, b, c
Planning Time: 0.076 ms
Execution Time: 803.078 ms
(7 rows)-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by
a,b,c;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25
rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)
Output: a, c, b
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.053 ms
Execution Time: 12.632 ms
(6 rows)This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).
Yeah, I see it as separate to this patch as well. But definitely
something that should be revisited.
Thanks for your patch ! v29 using UniqueKey attached.
Best regards,
Jesper
Attachments:
v29_0001-Unique-key.patchtext/x-patch; charset=UTF-8; name=v29_0001-Unique-key.patchDownload
From 4e27a04702002d06f60468f8a9033d2ac2e12d8a Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Mon, 11 Nov 2019 08:49:43 -0500
Subject: [PATCH 1/2] Unique key
Design by David Rowley.
Author: Jesper Pedersen
---
src/backend/nodes/outfuncs.c | 14 +++
src/backend/nodes/print.c | 39 +++++++
src/backend/optimizer/path/Makefile | 3 +-
src/backend/optimizer/path/allpaths.c | 8 ++
src/backend/optimizer/path/costsize.c | 5 +
src/backend/optimizer/path/indxpath.c | 41 +++++++
src/backend/optimizer/path/pathkeys.c | 71 ++++++++++--
src/backend/optimizer/path/uniquekey.c | 147 +++++++++++++++++++++++++
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planmain.c | 1 +
src/backend/optimizer/plan/planner.c | 17 ++-
src/backend/optimizer/util/pathnode.c | 12 ++
src/include/nodes/nodes.h | 1 +
src/include/nodes/pathnodes.h | 18 +++
src/include/nodes/print.h | 1 +
src/include/optimizer/pathnode.h | 1 +
src/include/optimizer/paths.h | 11 ++
17 files changed, 378 insertions(+), 13 deletions(-)
create mode 100644 src/backend/optimizer/path/uniquekey.c
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..1ccd68d3aa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
WRITE_FLOAT_FIELD(startup_cost, "%.2f");
WRITE_FLOAT_FIELD(total_cost, "%.2f");
WRITE_NODE_FIELD(pathkeys);
+ WRITE_NODE_FIELD(uniquekeys);
}
/*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(eq_classes);
WRITE_BOOL_FIELD(ec_merging_done);
WRITE_NODE_FIELD(canon_pathkeys);
+ WRITE_NODE_FIELD(canon_uniquekeys);
WRITE_NODE_FIELD(left_join_clauses);
WRITE_NODE_FIELD(right_join_clauses);
WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(placeholder_list);
WRITE_NODE_FIELD(fkey_list);
WRITE_NODE_FIELD(query_pathkeys);
+ WRITE_NODE_FIELD(query_uniquekeys);
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
WRITE_BOOL_FIELD(pk_nulls_first);
}
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+ WRITE_NODE_TYPE("UNIQUEKEY");
+
+ WRITE_NODE_FIELD(eq_clause);
+}
+
static void
_outPathTarget(StringInfo str, const PathTarget *node)
{
@@ -4083,6 +4094,9 @@ outNode(StringInfo str, const void *obj)
case T_PathKey:
_outPathKey(str, obj);
break;
+ case T_UniqueKey:
+ _outUniqueKey(str, obj);
+ break;
case T_PathTarget:
_outPathTarget(str, obj);
break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..62c9d4ef7a 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
}
+/*
+ * print_uniquekeys -
+ * unique_key an UniqueKey
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+ ListCell *l;
+
+ printf("(");
+ foreach(l, uniquekeys)
+ {
+ UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+ EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+ ListCell *k;
+ bool first = true;
+
+ /* chase up */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ printf("(");
+ foreach(k, eclass->ec_members)
+ {
+ EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+ if (first)
+ first = false;
+ else
+ printf(", ");
+ print_expr((Node *) mem->em_expr, rtable);
+ }
+ printf(")");
+ if (lnext(uniquekeys, l))
+ printf(", ");
+ }
+ printf(")\n");
+}
+
/*
* print_tl
* print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
joinpath.o \
joinrels.o \
pathkeys.o \
- tidpath.o
+ tidpath.o \
+ uniquekey.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..5fc9b81746 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
+ if (path->uniquekeys)
+ {
+ for (i = 0; i < indent; i++)
+ printf("\t");
+ printf(" uniquekeys: ");
+ print_uniquekeys(path->uniquekeys, root->parse->rtable);
+ }
+
if (join)
{
JoinPath *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..0ec9a6db76 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -705,6 +705,11 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.parallel_aware = true;
}
+ /* Consider cost based on unique key */
+ if (path->path.uniquekeys)
+ {
+ }
+
/*
* Now interpolate based on estimated index order correlation to get total
* disk I/O cost for main table accesses.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 37b257cd0e..aa0da49119 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -189,6 +189,7 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static List *get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys);
/*
@@ -874,6 +875,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
+ List *useful_uniquekeys;
bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
@@ -1036,11 +1038,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
if (index_clauses != NIL || useful_pathkeys != NIL || useful_predicate ||
index_only_scan)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1063,6 +1069,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1093,11 +1100,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
index_pathkeys);
if (useful_pathkeys != NIL)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -1115,6 +1126,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -3365,6 +3377,35 @@ match_clause_to_ordering_op(IndexOptInfo *index,
return clause;
}
+/*
+ * get_uniquekeys_for_index
+ */
+static List *
+get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys)
+{
+ ListCell *lc;
+
+ if (pathkeys)
+ {
+ List *uniquekeys = NIL;
+ foreach(lc, pathkeys)
+ {
+ UniqueKey *unique_key;
+ PathKey *pk = (PathKey *) lfirst(lc);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ unique_key = makeNode(UniqueKey);
+ unique_key->eq_clause = ec;
+
+ lappend(uniquekeys, unique_key);
+ }
+
+ if (uniquekeys_contained_in(root->canon_uniquekeys, uniquekeys))
+ return uniquekeys;
+ }
+
+ return NIL;
+}
/****************************************************************************
* ---- ROUTINES TO DO PARTIAL INDEX PREDICATE TESTS ----
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..b0dc1bc22a 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Checks if the new pathkey's equivalence class is the same as that of
+ * any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already is already in the list, then not unique */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return false;
+ }
+
+ return true;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +159,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return !pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1112,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ * Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, sortclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ if (pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+
+ return pathkeys;
+}
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekey.c b/src/backend/optimizer/path/uniquekey.c
new file mode 100644
index 0000000000..13d4ebb98c
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekey.c
@@ -0,0 +1,147 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/optimizer/path/uniquekey.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "nodes/pg_list.h"
+
+static UniqueKey *make_canonical_uniquekey(PlannerInfo *root, EquivalenceClass *eclass);
+
+/*
+ * Build a list of unique keys
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+ List *result = NIL;
+ List *sortkeys;
+ ListCell *l;
+
+ sortkeys = make_pathkeys_for_uniquekeys(root,
+ sortclauses,
+ root->processed_tlist);
+
+ /* Create a uniquekey and add it to the list */
+ foreach(l, sortkeys)
+ {
+ PathKey *pathkey = (PathKey *) lfirst(l);
+ EquivalenceClass *ec = pathkey->pk_eclass;
+ UniqueKey *unique_key = make_canonical_uniquekey(root, ec);
+
+ result = lappend(result, unique_key);
+ }
+
+ return result;
+}
+
+/*
+ * uniquekeys_contained_in
+ * Are the keys2 included in the keys1 superset
+ */
+bool
+uniquekeys_contained_in(List *keys1, List *keys2)
+{
+ ListCell *key1,
+ *key2;
+
+ /*
+ * Fall out quickly if we are passed two identical lists. This mostly
+ * catches the case where both are NIL, but that's common enough to
+ * warrant the test.
+ */
+ if (keys1 == keys2)
+ return true;
+
+ foreach(key2, keys2)
+ {
+ bool found = false;
+ UniqueKey *uniquekey2 = (UniqueKey *) lfirst(key2);
+
+ foreach(key1, keys1)
+ {
+ UniqueKey *uniquekey1 = (UniqueKey *) lfirst(key1);
+
+ if (uniquekey1->eq_clause == uniquekey2->eq_clause)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * has_useful_uniquekeys
+ * Detect whether the planner could have any uniquekeys that are
+ * useful.
+ */
+bool
+has_useful_uniquekeys(PlannerInfo *root)
+{
+ if (root->query_uniquekeys != NIL)
+ return true; /* there are some */
+ return false; /* definitely useless */
+}
+
+/*
+ * make_canonical_uniquekey
+ * Given the parameters for a UniqueKey, find any pre-existing matching
+ * uniquekey in the query's list of "canonical" uniquekeys. Make a new
+ * entry if there's not one already.
+ *
+ * Note that this function must not be used until after we have completed
+ * merging EquivalenceClasses. (We don't try to enforce that here; instead,
+ * equivclass.c will complain if a merge occurs after root->canon_uniquekeys
+ * has become nonempty.)
+ */
+static UniqueKey *
+make_canonical_uniquekey(PlannerInfo *root,
+ EquivalenceClass *eclass)
+{
+ UniqueKey *uk;
+ ListCell *lc;
+ MemoryContext oldcontext;
+
+ /* The passed eclass might be non-canonical, so chase up to the top */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ foreach(lc, root->canon_uniquekeys)
+ {
+ uk = (UniqueKey *) lfirst(lc);
+ if (eclass == uk->eq_clause)
+ return uk;
+ }
+
+ /*
+ * Be sure canonical uniquekeys are allocated in the main planning context.
+ * Not an issue in normal planning, but it is for GEQO.
+ */
+ oldcontext = MemoryContextSwitchTo(root->planner_cxt);
+
+ uk = makeNode(UniqueKey);
+ uk->eq_clause = eclass;
+
+ root->canon_uniquekeys = lappend(root->canon_uniquekeys, uk);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return uk;
+}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 9381939c82..3d32f6dfd6 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -512,6 +512,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->parse->targetList);
root->query_pathkeys = root->sort_pathkeys;
+ root->query_uniquekeys = NIL;
}
/*
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index f0c1b52a2e..3ccde03ab7 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -70,6 +70,7 @@ query_planner(PlannerInfo *root,
root->join_rel_level = NULL;
root->join_cur_level = 0;
root->canon_pathkeys = NIL;
+ root->canon_uniquekeys = NIL;
root->left_join_clauses = NIL;
root->right_join_clauses = NIL;
root->full_join_clauses = NIL;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..2507ec7d2a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3652,15 +3652,30 @@ standard_qp_callback(PlannerInfo *root, void *extra)
* much easier, since we know that the parser ensured that one is a
* superset of the other.
*/
+ root->query_uniquekeys = NIL;
+
if (root->group_pathkeys)
+ {
root->query_pathkeys = root->group_pathkeys;
+
+ if (!root->parse->hasAggs)
+ root->query_uniquekeys = build_uniquekeys(root, qp_extra->groupClause);
+ }
else if (root->window_pathkeys)
root->query_pathkeys = root->window_pathkeys;
else if (list_length(root->distinct_pathkeys) >
list_length(root->sort_pathkeys))
+ {
root->query_pathkeys = root->distinct_pathkeys;
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else if (root->sort_pathkeys)
+ {
root->query_pathkeys = root->sort_pathkeys;
+
+ if (root->distinct_pathkeys)
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else
root->query_pathkeys = NIL;
}
@@ -6217,7 +6232,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of index scan */
indexScanPath = create_index_path(root, indexInfo,
- NIL, NIL, NIL, NIL,
+ NIL, NIL, NIL, NIL, NIL,
ForwardScanDirection, false,
NULL, 1.0, false);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..f268112386 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -941,6 +941,7 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = parallel_workers;
pathnode->pathkeys = NIL; /* seqscan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_seqscan(pathnode, root, rel, pathnode->param_info);
@@ -965,6 +966,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* samplescan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -1001,6 +1003,7 @@ create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
@@ -1019,6 +1022,7 @@ create_index_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = 0;
pathnode->path.pathkeys = pathkeys;
+ pathnode->path.uniquekeys = uniquekeys;
pathnode->indexinfo = index;
pathnode->indexclauses = indexclauses;
@@ -1062,6 +1066,7 @@ create_bitmap_heap_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = parallel_degree;
pathnode->path.pathkeys = NIL; /* always unordered */
+ pathnode->path.uniquekeys = NIL;
pathnode->bitmapqual = bitmapqual;
@@ -1923,6 +1928,7 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = pathkeys;
+ pathnode->uniquekeys = NIL;
cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1949,6 +1955,7 @@ create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_tablefuncscan(pathnode, root, rel, pathnode->param_info);
@@ -1975,6 +1982,7 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -2000,6 +2008,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* XXX for now, result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -2026,6 +2035,7 @@ create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_namedtuplestorescan(pathnode, root, rel, pathnode->param_info);
@@ -2052,6 +2062,7 @@ create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_resultscan(pathnode, root, rel, pathnode->param_info);
@@ -2078,6 +2089,7 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
/* Cost is the same as for a regular CTE scan */
cost_ctescan(pathnode, root, rel, pathnode->param_info);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59b0d..cbb6ba2586 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -261,6 +261,7 @@ typedef enum NodeTag
T_EquivalenceMember,
T_PathKey,
T_PathTarget,
+ T_UniqueKey,
T_RestrictInfo,
T_IndexClause,
T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..6198c31cd4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -267,6 +267,8 @@ struct PlannerInfo
List *canon_pathkeys; /* list of "canonical" PathKeys */
+ List *canon_uniquekeys; /* list of "canonical" UniqueKeys */
+
List *left_join_clauses; /* list of RestrictInfos for mergejoinable
* outer join clauses w/nonnullable var on
* left */
@@ -295,6 +297,8 @@ struct PlannerInfo
List *query_pathkeys; /* desired pathkeys for query_planner() */
+ List *query_uniquekeys; /* */
+
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
@@ -1075,6 +1079,15 @@ typedef struct ParamPathInfo
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+ NodeTag type;
+
+ EquivalenceClass *eq_clause; /* equivalence class */
+} UniqueKey;
/*
* Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1104,6 +1117,9 @@ typedef struct ParamPathInfo
*
* "pathkeys" is a List of PathKey nodes (see above), describing the sort
* ordering of the path's output rows.
+ *
+ * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
*/
typedef struct Path
{
@@ -1127,6 +1143,8 @@ typedef struct Path
List *pathkeys; /* sort ordering of path's output */
/* pathkeys is a List of PathKey nodes; see above */
+
+ List *uniquekeys; /* the unique keys, or NIL if none */
} Path;
/* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index cbff56a724..f1a7112877 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump);
extern void print_rt(const List *rtable);
extern void print_expr(const Node *expr, const List *rtable);
extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_uniquekeys(const List *uniquekeys, const List *rtable);
extern void print_tl(const List *tlist, const List *rtable);
extern void print_slot(TupleTableSlot *slot);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..37a946f857 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c6c34630c2..c79e47eeaf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+/*
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
#endif /* PATHS_H */
--
2.21.0
v29_0002-Index-skip-scan.patchtext/x-patch; charset=UTF-8; name=v29_0002-Index-skip-scan.patchDownload
From 8c0b71051757e06dc312f20d3ffda0d7545d9b2b Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Mon, 11 Nov 2019 09:06:31 -0500
Subject: [PATCH 2/2] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 23 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 363 +++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 25 +
src/backend/executor/nodeIndexonlyscan.c | 51 +-
src/backend/executor/nodeIndexscan.c | 51 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planner.c | 76 +++
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/access/sdir.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 4 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/select_distinct.out | 505 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/select_distinct.sql | 186 +++++++
37 files changed, 1510 insertions(+), 11 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index e2063bac62..bc3cf8e7fe 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f83770350e..b3a96af1f5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4478,6 +4478,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..94e09835b4 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..efc5e41389 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ When the rows retrieved from an index scan are then deduplicated by
+ eliminating rows matching on a prefix of index keys (e.g. when using
+ <literal>SELECT DISTINCT</literal>), the planner will consider
+ skipping groups of rows with a matching key prefix. When a row with
+ a particular prefix is found, remaining rows with the same key prefix
+ are skipped. The larger the number of rows with the same key prefix
+ rows (i.e. the lower the number of distinct key prefixes in the index),
+ the more efficient this is.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 294ffa6e20..58919ca708 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 38593554f0..eaaf7db78d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6..38072ad24b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -83,6 +83,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6c058362bd..9e5dbf6097 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -82,6 +82,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9dfa0ddfbb..237efb86a2 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9e9d5c77c3 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,6 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
@@ -1375,6 +1379,309 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * The current page is searched for the next unique value. If none is found
+ * we will do a scan from the root in order to find the next page with
+ * a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements in
+ * order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case is when both directions are forward, when we are already
+ * at the next distinct key at the beginning of the series (so everything
+ * else would be done in _bt_readpage)
+ *
+ * The case when both directions are backwards is also simple, but we need
+ * to go one step back, since we need a last element from the previous
+ * series.
+ */
+ if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise
+ * find previous distinct key and the beginning of it's series and read
+ * forward from there. To do so, go back one step, perform binary search
+ * to find the first item in the series and let _bt_readpage do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ if (!scanstart)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* One step back to find a previous value */
+ _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /*
+ * And now find the last item from the sequence for the
+ * current, value with the intention do OffsetNumberNext. As a
+ * result we end up on a first element from the sequence.
+ */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Advance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing
+ * anything else. Otherwise find the next distinct key and the beginning
+ * of it's series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinct key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset,
+ startOffset;
+
+ nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ while (nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage returns already found item */
+ if (_bt_readpage(scan, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2246,3 +2553,59 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts,
+ i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low,
+ high,
+ compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ if (unlikely(high < low))
+ return false;
+
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..b66296d6c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1618,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 784486f0c8..985fc3c50f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -72,7 +79,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,14 +122,50 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->ioss_SkipPrefixSize > 0 &&
+ (node->ioss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ !node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset ioss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ tid = &scandesc->xs_heaptid;
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
- while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+ while (skipped || (tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* We can skip the heap fetch if the TID references a heap page on
@@ -250,6 +293,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -504,6 +549,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index c06d07aa46..3e82fa37c7 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +123,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,12 +135,48 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->iss_SkipPrefixSize > 0 &&
+ (node->iss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+ !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset iss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ index_fetch_heap(scandesc, slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while (skipped || index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* If the index was lossy, we have to recheck the index quals using
@@ -149,6 +193,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -910,6 +955,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3432bb921d..fdc4ed0299 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1ccd68d3aa..aec39e7ba0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0ec9a6db76..d9a3343d50 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index aee81bd755..5b9a41ef10 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2908,7 +2910,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2919,7 +2922,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5182,7 +5186,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5199,6 +5204,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5211,7 +5217,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5226,6 +5233,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2507ec7d2a..53f9872943 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4829,6 +4829,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->query_uniquekeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just
+ * a number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position
+ * of a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->query_uniquekeys)
+ {
+ UniqueKey *uniquekey = (UniqueKey *) lfirst(lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(uniquekey->eq_clause->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a,
+ * b) with arbitrary one of c, so if the choosed c does
+ * not match the qual and there is any c that matches the
+ * qual, we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List *) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f268112386..398c7b1a59 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2916,6 +2916,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index e5f9e04d65..c86ec0835c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 994bf37477..054cd44be4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -916,6 +916,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index be02a76d9d..cab198feb9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a813b004be..d33e995a73 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..cf7a24444d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 664e72ef5d..dff90fada1 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
#define ScanDirectionIsForward(direction) \
((bool) ((direction) == ForwardScanDirection))
+/*
+ * ScanDirectionsAreOpposite
+ * True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+ ((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
#endif /* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44f76082e9..9e6d501ad1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1428,6 +1428,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1457,6 +1459,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1475,6 +1479,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6198c31cd4..d5d238e1aa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -837,6 +837,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1187,6 +1188,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1199,6 +1203,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..f09c8c43a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,8 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexScan;
/* ----------------
@@ -432,6 +434,8 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 37a946f857..09d61a8e99 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -201,6 +201,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..51e12ac925 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,508 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, tenthous, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+ QUERY PLAN
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+ Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ QUERY PLAN
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+ Skip scan: true
+ Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ QUERY PLAN
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+ Skip scan: true
+ Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ QUERY PLAN
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+ Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Bitmap Heap Scan on distinct_a
+ Recheck Cond: (a = 1)
+ -> Bitmap Index Scan on distinct_a_a_b_idx
+ Index Cond: (a = 1)
+(5 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ QUERY PLAN
+---------------------------------------------------------
+ Unique
+ -> Index Scan using distinct_a_a_b_idx on distinct_a
+ Index Cond: (b = 2)
+ Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column?
+---+----------
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..4c8a50d153 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,189 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, tenthous, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0
Hi,
On 11/11/19 1:24 PM, Jesper Pedersen wrote:
v29 using UniqueKey attached.
Just a small update to the UniqueKey patch to hopefully keep CFbot happy.
Feedback, especially on the planner changes, would be greatly appreciated.
Best regards,
Jesper
Attachments:
v30_0001-Unique-key.patchtext/x-patch; charset=UTF-8; name=v30_0001-Unique-key.patchDownload
From b1a69c2791c8aba6caa85d7f24b9836641150875 Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH] Unique key
Design by David Rowley.
Author: Jesper Pedersen
---
src/backend/nodes/outfuncs.c | 14 +++
src/backend/nodes/print.c | 39 +++++++
src/backend/optimizer/path/Makefile | 3 +-
src/backend/optimizer/path/allpaths.c | 8 ++
src/backend/optimizer/path/indxpath.c | 41 +++++++
src/backend/optimizer/path/pathkeys.c | 71 ++++++++++--
src/backend/optimizer/path/uniquekey.c | 147 +++++++++++++++++++++++++
src/backend/optimizer/plan/planagg.c | 1 +
src/backend/optimizer/plan/planmain.c | 1 +
src/backend/optimizer/plan/planner.c | 17 ++-
src/backend/optimizer/util/pathnode.c | 12 ++
src/include/nodes/nodes.h | 1 +
src/include/nodes/pathnodes.h | 18 +++
src/include/nodes/print.h | 1 +
src/include/optimizer/pathnode.h | 1 +
src/include/optimizer/paths.h | 11 ++
16 files changed, 373 insertions(+), 13 deletions(-)
create mode 100644 src/backend/optimizer/path/uniquekey.c
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..1ccd68d3aa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
WRITE_FLOAT_FIELD(startup_cost, "%.2f");
WRITE_FLOAT_FIELD(total_cost, "%.2f");
WRITE_NODE_FIELD(pathkeys);
+ WRITE_NODE_FIELD(uniquekeys);
}
/*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(eq_classes);
WRITE_BOOL_FIELD(ec_merging_done);
WRITE_NODE_FIELD(canon_pathkeys);
+ WRITE_NODE_FIELD(canon_uniquekeys);
WRITE_NODE_FIELD(left_join_clauses);
WRITE_NODE_FIELD(right_join_clauses);
WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
WRITE_NODE_FIELD(placeholder_list);
WRITE_NODE_FIELD(fkey_list);
WRITE_NODE_FIELD(query_pathkeys);
+ WRITE_NODE_FIELD(query_uniquekeys);
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(window_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
WRITE_BOOL_FIELD(pk_nulls_first);
}
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+ WRITE_NODE_TYPE("UNIQUEKEY");
+
+ WRITE_NODE_FIELD(eq_clause);
+}
+
static void
_outPathTarget(StringInfo str, const PathTarget *node)
{
@@ -4083,6 +4094,9 @@ outNode(StringInfo str, const void *obj)
case T_PathKey:
_outPathKey(str, obj);
break;
+ case T_UniqueKey:
+ _outUniqueKey(str, obj);
+ break;
case T_PathTarget:
_outPathTarget(str, obj);
break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..435b32063c 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
}
+/*
+ * print_uniquekeys -
+ * uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+ ListCell *l;
+
+ printf("(");
+ foreach(l, uniquekeys)
+ {
+ UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+ EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+ ListCell *k;
+ bool first = true;
+
+ /* chase up */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ printf("(");
+ foreach(k, eclass->ec_members)
+ {
+ EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+ if (first)
+ first = false;
+ else
+ printf(", ");
+ print_expr((Node *) mem->em_expr, rtable);
+ }
+ printf(")");
+ if (lnext(uniquekeys, l))
+ printf(", ");
+ }
+ printf(")\n");
+}
+
/*
* print_tl
* print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
joinpath.o \
joinrels.o \
pathkeys.o \
- tidpath.o
+ tidpath.o \
+ uniquekey.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..5fc9b81746 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
+ if (path->uniquekeys)
+ {
+ for (i = 0; i < indent; i++)
+ printf("\t");
+ printf(" uniquekeys: ");
+ print_uniquekeys(path->uniquekeys, root->parse->rtable);
+ }
+
if (join)
{
JoinPath *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 37b257cd0e..88c1dd0f59 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -189,6 +189,7 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static List *get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys);
/*
@@ -874,6 +875,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
+ List *useful_uniquekeys = NIL;
bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
@@ -1036,11 +1038,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
if (index_clauses != NIL || useful_pathkeys != NIL || useful_predicate ||
index_only_scan)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1063,6 +1069,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
orderbyclauses,
orderbyclausecols,
useful_pathkeys,
+ useful_uniquekeys,
index_is_ordered ?
ForwardScanDirection :
NoMovementScanDirection,
@@ -1093,11 +1100,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
index_pathkeys);
if (useful_pathkeys != NIL)
{
+ if (has_useful_uniquekeys(root))
+ useful_uniquekeys = get_uniquekeys_for_index(root, useful_pathkeys);
+
ipath = create_index_path(root, index,
index_clauses,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -1115,6 +1126,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
NIL,
NIL,
useful_pathkeys,
+ useful_uniquekeys,
BackwardScanDirection,
index_only_scan,
outer_relids,
@@ -3365,6 +3377,35 @@ match_clause_to_ordering_op(IndexOptInfo *index,
return clause;
}
+/*
+ * get_uniquekeys_for_index
+ */
+static List *
+get_uniquekeys_for_index(PlannerInfo *root, List *pathkeys)
+{
+ ListCell *lc;
+
+ if (pathkeys)
+ {
+ List *uniquekeys = NIL;
+ foreach(lc, pathkeys)
+ {
+ UniqueKey *unique_key;
+ PathKey *pk = (PathKey *) lfirst(lc);
+ EquivalenceClass *ec = (EquivalenceClass *) pk->pk_eclass;
+
+ unique_key = makeNode(UniqueKey);
+ unique_key->eq_clause = ec;
+
+ lappend(uniquekeys, unique_key);
+ }
+
+ if (uniquekeys_contained_in(root->canon_uniquekeys, uniquekeys))
+ return uniquekeys;
+ }
+
+ return NIL;
+}
/****************************************************************************
* ---- ROUTINES TO DO PARTIAL INDEX PREDICATE TESTS ----
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..b0dc1bc22a 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
#include "utils/lsyscache.h"
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
return pk;
}
+/*
+ * pathkey_is_unique
+ * Checks if the new pathkey's equivalence class is the same as that of
+ * any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+ EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+ ListCell *lc;
+
+ /* If same EC already is already in the list, then not unique */
+ foreach(lc, pathkeys)
+ {
+ PathKey *old_pathkey = (PathKey *) lfirst(lc);
+
+ if (new_ec == old_pathkey->pk_eclass)
+ return false;
+ }
+
+ return true;
+}
+
/*
* pathkey_is_redundant
* Is a pathkey redundant with one already in the given list?
@@ -135,22 +159,12 @@ static bool
pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys)
{
EquivalenceClass *new_ec = new_pathkey->pk_eclass;
- ListCell *lc;
/* Check for EC containing a constant --- unconditionally redundant */
if (EC_MUST_BE_REDUNDANT(new_ec))
return true;
- /* If same EC already used in list, then redundant */
- foreach(lc, pathkeys)
- {
- PathKey *old_pathkey = (PathKey *) lfirst(lc);
-
- if (new_ec == old_pathkey->pk_eclass)
- return true;
- }
-
- return false;
+ return !pathkey_is_unique(new_pathkey, pathkeys);
}
/*
@@ -1098,6 +1112,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
return pathkeys;
}
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ * Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist)
+{
+ List *pathkeys = NIL;
+ ListCell *l;
+
+ foreach(l, sortclauses)
+ {
+ SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+ Expr *sortkey;
+ PathKey *pathkey;
+
+ sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+ Assert(OidIsValid(sortcl->sortop));
+ pathkey = make_pathkey_from_sortop(root,
+ sortkey,
+ root->nullable_baserels,
+ sortcl->sortop,
+ sortcl->nulls_first,
+ sortcl->tleSortGroupRef,
+ true);
+
+ if (pathkey_is_unique(pathkey, pathkeys))
+ pathkeys = lappend(pathkeys, pathkey);
+ }
+
+ return pathkeys;
+}
+
/****************************************************************************
* PATHKEYS AND MERGECLAUSES
****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekey.c b/src/backend/optimizer/path/uniquekey.c
new file mode 100644
index 0000000000..13d4ebb98c
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekey.c
@@ -0,0 +1,147 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/optimizer/path/uniquekey.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "nodes/pg_list.h"
+
+static UniqueKey *make_canonical_uniquekey(PlannerInfo *root, EquivalenceClass *eclass);
+
+/*
+ * Build a list of unique keys
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+ List *result = NIL;
+ List *sortkeys;
+ ListCell *l;
+
+ sortkeys = make_pathkeys_for_uniquekeys(root,
+ sortclauses,
+ root->processed_tlist);
+
+ /* Create a uniquekey and add it to the list */
+ foreach(l, sortkeys)
+ {
+ PathKey *pathkey = (PathKey *) lfirst(l);
+ EquivalenceClass *ec = pathkey->pk_eclass;
+ UniqueKey *unique_key = make_canonical_uniquekey(root, ec);
+
+ result = lappend(result, unique_key);
+ }
+
+ return result;
+}
+
+/*
+ * uniquekeys_contained_in
+ * Are the keys2 included in the keys1 superset
+ */
+bool
+uniquekeys_contained_in(List *keys1, List *keys2)
+{
+ ListCell *key1,
+ *key2;
+
+ /*
+ * Fall out quickly if we are passed two identical lists. This mostly
+ * catches the case where both are NIL, but that's common enough to
+ * warrant the test.
+ */
+ if (keys1 == keys2)
+ return true;
+
+ foreach(key2, keys2)
+ {
+ bool found = false;
+ UniqueKey *uniquekey2 = (UniqueKey *) lfirst(key2);
+
+ foreach(key1, keys1)
+ {
+ UniqueKey *uniquekey1 = (UniqueKey *) lfirst(key1);
+
+ if (uniquekey1->eq_clause == uniquekey2->eq_clause)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * has_useful_uniquekeys
+ * Detect whether the planner could have any uniquekeys that are
+ * useful.
+ */
+bool
+has_useful_uniquekeys(PlannerInfo *root)
+{
+ if (root->query_uniquekeys != NIL)
+ return true; /* there are some */
+ return false; /* definitely useless */
+}
+
+/*
+ * make_canonical_uniquekey
+ * Given the parameters for a UniqueKey, find any pre-existing matching
+ * uniquekey in the query's list of "canonical" uniquekeys. Make a new
+ * entry if there's not one already.
+ *
+ * Note that this function must not be used until after we have completed
+ * merging EquivalenceClasses. (We don't try to enforce that here; instead,
+ * equivclass.c will complain if a merge occurs after root->canon_uniquekeys
+ * has become nonempty.)
+ */
+static UniqueKey *
+make_canonical_uniquekey(PlannerInfo *root,
+ EquivalenceClass *eclass)
+{
+ UniqueKey *uk;
+ ListCell *lc;
+ MemoryContext oldcontext;
+
+ /* The passed eclass might be non-canonical, so chase up to the top */
+ while (eclass->ec_merged)
+ eclass = eclass->ec_merged;
+
+ foreach(lc, root->canon_uniquekeys)
+ {
+ uk = (UniqueKey *) lfirst(lc);
+ if (eclass == uk->eq_clause)
+ return uk;
+ }
+
+ /*
+ * Be sure canonical uniquekeys are allocated in the main planning context.
+ * Not an issue in normal planning, but it is for GEQO.
+ */
+ oldcontext = MemoryContextSwitchTo(root->planner_cxt);
+
+ uk = makeNode(UniqueKey);
+ uk->eq_clause = eclass;
+
+ root->canon_uniquekeys = lappend(root->canon_uniquekeys, uk);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return uk;
+}
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 974f6204ca..f6144b2267 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -511,6 +511,7 @@ minmax_qp_callback(PlannerInfo *root, void *extra)
root->parse->targetList);
root->query_pathkeys = root->sort_pathkeys;
+ root->query_uniquekeys = NIL;
}
/*
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index f0c1b52a2e..3ccde03ab7 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -70,6 +70,7 @@ query_planner(PlannerInfo *root,
root->join_rel_level = NULL;
root->join_cur_level = 0;
root->canon_pathkeys = NIL;
+ root->canon_uniquekeys = NIL;
root->left_join_clauses = NIL;
root->right_join_clauses = NIL;
root->full_join_clauses = NIL;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7fe11b59a0..8f03a20825 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3651,15 +3651,30 @@ standard_qp_callback(PlannerInfo *root, void *extra)
* much easier, since we know that the parser ensured that one is a
* superset of the other.
*/
+ root->query_uniquekeys = NIL;
+
if (root->group_pathkeys)
+ {
root->query_pathkeys = root->group_pathkeys;
+
+ if (!root->parse->hasAggs)
+ root->query_uniquekeys = build_uniquekeys(root, qp_extra->groupClause);
+ }
else if (root->window_pathkeys)
root->query_pathkeys = root->window_pathkeys;
else if (list_length(root->distinct_pathkeys) >
list_length(root->sort_pathkeys))
+ {
root->query_pathkeys = root->distinct_pathkeys;
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else if (root->sort_pathkeys)
+ {
root->query_pathkeys = root->sort_pathkeys;
+
+ if (root->distinct_pathkeys)
+ root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+ }
else
root->query_pathkeys = NIL;
}
@@ -6216,7 +6231,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
/* Estimate the cost of index scan */
indexScanPath = create_index_path(root, indexInfo,
- NIL, NIL, NIL, NIL,
+ NIL, NIL, NIL, NIL, NIL,
ForwardScanDirection, false,
NULL, 1.0, false);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60c93ee7c5..ec02c468d0 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -940,6 +940,7 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = parallel_workers;
pathnode->pathkeys = NIL; /* seqscan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_seqscan(pathnode, root, rel, pathnode->param_info);
@@ -964,6 +965,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* samplescan has unordered result */
+ pathnode->uniquekeys = NIL;
cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -1000,6 +1002,7 @@ create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
@@ -1018,6 +1021,7 @@ create_index_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = 0;
pathnode->path.pathkeys = pathkeys;
+ pathnode->path.uniquekeys = uniquekeys;
pathnode->indexinfo = index;
pathnode->indexclauses = indexclauses;
@@ -1061,6 +1065,7 @@ create_bitmap_heap_path(PlannerInfo *root,
pathnode->path.parallel_safe = rel->consider_parallel;
pathnode->path.parallel_workers = parallel_degree;
pathnode->path.pathkeys = NIL; /* always unordered */
+ pathnode->path.uniquekeys = NIL;
pathnode->bitmapqual = bitmapqual;
@@ -1922,6 +1927,7 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = pathkeys;
+ pathnode->uniquekeys = NIL;
cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1948,6 +1954,7 @@ create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_tablefuncscan(pathnode, root, rel, pathnode->param_info);
@@ -1974,6 +1981,7 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -1999,6 +2007,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* XXX for now, result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -2025,6 +2034,7 @@ create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_namedtuplestorescan(pathnode, root, rel, pathnode->param_info);
@@ -2051,6 +2061,7 @@ create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
cost_resultscan(pathnode, root, rel, pathnode->param_info);
@@ -2077,6 +2088,7 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->parallel_safe = rel->consider_parallel;
pathnode->parallel_workers = 0;
pathnode->pathkeys = NIL; /* result is always unordered */
+ pathnode->uniquekeys = NIL;
/* Cost is the same as for a regular CTE scan */
cost_ctescan(pathnode, root, rel, pathnode->param_info);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59b0d..cbb6ba2586 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -261,6 +261,7 @@ typedef enum NodeTag
T_EquivalenceMember,
T_PathKey,
T_PathTarget,
+ T_UniqueKey,
T_RestrictInfo,
T_IndexClause,
T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..10ece6c875 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -267,6 +267,8 @@ struct PlannerInfo
List *canon_pathkeys; /* list of "canonical" PathKeys */
+ List *canon_uniquekeys; /* list of "canonical" UniqueKeys */
+
List *left_join_clauses; /* list of RestrictInfos for mergejoinable
* outer join clauses w/nonnullable var on
* left */
@@ -295,6 +297,8 @@ struct PlannerInfo
List *query_pathkeys; /* desired pathkeys for query_planner() */
+ List *query_uniquekeys; /* unique keys used for the query */
+
List *group_pathkeys; /* groupClause pathkeys, if any */
List *window_pathkeys; /* pathkeys of bottom window, if any */
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
@@ -1075,6 +1079,15 @@ typedef struct ParamPathInfo
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+ NodeTag type;
+
+ EquivalenceClass *eq_clause; /* equivalence class */
+} UniqueKey;
/*
* Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1104,6 +1117,9 @@ typedef struct ParamPathInfo
*
* "pathkeys" is a List of PathKey nodes (see above), describing the sort
* ordering of the path's output rows.
+ *
+ * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
*/
typedef struct Path
{
@@ -1127,6 +1143,8 @@ typedef struct Path
List *pathkeys; /* sort ordering of path's output */
/* pathkeys is a List of PathKey nodes; see above */
+
+ List *uniquekeys; /* the unique keys, or NIL if none */
} Path;
/* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index cbff56a724..f1a7112877 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump);
extern void print_rt(const List *rtable);
extern void print_expr(const Node *expr, const List *rtable);
extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_uniquekeys(const List *uniquekeys, const List *rtable);
extern void print_tl(const List *tlist, const List *rtable);
extern void print_slot(TupleTableSlot *slot);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..37a946f857 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
List *indexorderbys,
List *indexorderbycols,
List *pathkeys,
+ List *uniquekeys,
ScanDirection indexscandir,
bool indexonly,
Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c6c34630c2..c79e47eeaf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
List *sortclauses,
List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+ List *sortclauses,
+ List *tlist);
extern void initialize_mergeclause_eclasses(PlannerInfo *root,
RestrictInfo *restrictinfo);
extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+/*
+ * uniquekey.c
+ * Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
#endif /* PATHS_H */
--
2.21.0
v30_0002-Index-skip-scan.patchtext/x-patch; charset=UTF-8; name=v30_0002-Index-skip-scan.patchDownload
From 513ed9b00e30fec85a6e7edf122bbc3e7e124a5e Mon Sep 17 00:00:00 2001
From: jesperpedersen <jesper.pedersen@redhat.com>
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH 2/2] Index skip scan
Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.
Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.
[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
contrib/bloom/blutils.c | 1 +
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/indexam.sgml | 63 +++
doc/src/sgml/indices.sgml | 23 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 1 +
src/backend/access/hash/hash.c | 1 +
src/backend/access/index/indexam.c | 18 +
src/backend/access/nbtree/nbtree.c | 13 +
src/backend/access/nbtree/nbtsearch.c | 363 +++++++++++++
src/backend/access/spgist/spgutils.c | 1 +
src/backend/commands/explain.c | 25 +
src/backend/executor/nodeIndexonlyscan.c | 51 +-
src/backend/executor/nodeIndexscan.c | 51 +-
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 20 +-
src/backend/optimizer/plan/planner.c | 76 +++
src/backend/optimizer/util/pathnode.c | 40 ++
src/backend/optimizer/util/plancat.c | 1 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/amapi.h | 8 +
src/include/access/genam.h | 2 +
src/include/access/nbtree.h | 7 +
src/include/access/sdir.h | 7 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/pathnodes.h | 5 +
src/include/nodes/plannodes.h | 4 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 5 +
src/test/regress/expected/select_distinct.out | 505 ++++++++++++++++++
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/select_distinct.sql | 186 +++++++
37 files changed, 1510 insertions(+), 11 deletions(-)
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index e2063bac62..bc3cf8e7fe 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = blbulkdelete;
amroutine->amvacuumcleanup = blvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = blcostestimate;
amroutine->amoptions = bloptions;
amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f83770350e..b3a96af1f5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4478,6 +4478,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+ <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of index-skip-scan plan
+ types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+ <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-material" xreflabel="enable_material">
<term><varname>enable_material</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..94e09835b4 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -144,6 +144,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
@@ -687,6 +688,68 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
+bool
+amskip (IndexScanDesc scan,
+ ScanDirection direction,
+ ScanDirection indexdir,
+ bool scanstart,
+ int prefix);
+</programlisting>
+ Skip past all tuples where the first 'prefix' columns have the same value as
+ the last tuple returned in the current scan. The arguments are:
+
+ <variablelist>
+ <varlistentry>
+ <term><parameter>scan</parameter></term>
+ <listitem>
+ <para>
+ Index scan information
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>direction</parameter></term>
+ <listitem>
+ <para>
+ The direction in which data is advancing.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>indexdir</parameter></term>
+ <listitem>
+ <para>
+ The index direction, in which data must be read.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>scanstart</parameter></term>
+ <listitem>
+ <para>
+ Whether or not it is a start of the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><parameter>prefix</parameter></term>
+ <listitem>
+ <para>
+ Distinct prefix size.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </para>
+
+ <para>
+<programlisting>
Size
amestimateparallelscan (void);
</programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 95c0a1926c..efc5e41389 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1235,6 +1235,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
</para>
+
+ <sect2 id="indexes-index-skip-scans">
+ <title>Index Skip Scans</title>
+
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index</primary>
+ <secondary>index-skip scans</secondary>
+ </indexterm>
+ <indexterm zone="indexes-index-skip-scans">
+ <primary>index-skip scan</primary>
+ </indexterm>
+
+ <para>
+ When the rows retrieved from an index scan are then deduplicated by
+ eliminating rows matching on a prefix of index keys (e.g. when using
+ <literal>SELECT DISTINCT</literal>), the planner will consider
+ skipping groups of rows with a matching key prefix. When a row with
+ a particular prefix is found, remaining rows with the same key prefix
+ are skipped. The larger the number of rows with the same key prefix
+ rows (i.e. the lower the number of distinct key prefixes in the index),
+ the more efficient this is.
+ </para>
+ </sect2>
</sect1>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 294ffa6e20..58919ca708 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -109,6 +109,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = brinbulkdelete;
amroutine->amvacuumcleanup = brinvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = brincostestimate;
amroutine->amoptions = brinoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 38593554f0..eaaf7db78d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,6 +61,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = ginbulkdelete;
amroutine->amvacuumcleanup = ginvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gincostestimate;
amroutine->amoptions = ginoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8d9c8d025d..7015485430 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -82,6 +82,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = gistbulkdelete;
amroutine->amvacuumcleanup = gistvacuumcleanup;
amroutine->amcanreturn = gistcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = gistcostestimate;
amroutine->amoptions = gistoptions;
amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a0597a0c6e..b69a54cb71 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -80,6 +80,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = hashbulkdelete;
amroutine->amvacuumcleanup = hashvacuumcleanup;
amroutine->amcanreturn = NULL;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = hashcostestimate;
amroutine->amoptions = hashoptions;
amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9dfa0ddfbb..237efb86a2 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_skip - advance past duplicate key values in a scan
*
* NOTES
* This file contains the index_ routines which used
@@ -730,6 +731,23 @@ index_can_return(Relation indexRelation, int attno)
return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
}
+/* ----------------
+ * index_skip
+ *
+ * Skip past all tuples where the first 'prefix' columns have the
+ * same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ SCAN_CHECKS;
+
+ return scan->indexRelation->rd_indam->amskip(scan, direction,
+ indexdir, scanstart, prefix);
+}
+
/* ----------------
* index_getprocid
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..46471598d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -131,6 +131,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
+ amroutine->amskip = btskip;
amroutine->amcostestimate = btcostestimate;
amroutine->amoptions = btoptions;
amroutine->amproperty = btproperty;
@@ -380,6 +381,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->skipScanKey = NULL;
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -447,6 +450,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_bt_preprocess_array_keys(scan);
}
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix)
+{
+ return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
/*
* btendscan() -- close down a scan
*/
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9e9d5c77c3 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -37,6 +37,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+ Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir);
/*
@@ -1375,6 +1379,309 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
+/*
+ * _bt_skip() -- Skip items that have the same prefix as the most recently
+ * fetched index tuple.
+ *
+ * The current position is set so that a subsequent call to _bt_next will
+ * fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * There are four different kinds of skipping (depending on dir and
+ * indexdir, that are important to distinguish, especially in the presense
+ * of an index condition:
+ *
+ * * Advancing forward and reading forward
+ * simple scan
+ *
+ * * Advancing forward and reading backward
+ * scan inside a cursor fetching backward, when skipping is necessary
+ * right from the start
+ *
+ * * Advancing backward and reading forward
+ * scan with order by desc inside a cursor fetching forward, when
+ * skipping is necessary right from the start
+ *
+ * * Advancing backward and reading backward
+ * simple scan with order by desc
+ *
+ * The current page is searched for the next unique value. If none is found
+ * we will do a scan from the root in order to find the next page with
+ * a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool scanstart, int prefix)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTStack stack;
+ Buffer buf;
+ OffsetNumber offnum;
+ BTScanPosItem *currItem;
+ Relation indexRel = scan->indexRelation;
+
+ /* We want to return tuples, and we need a starting point */
+ Assert(scan->xs_want_itup);
+ Assert(scan->xs_itup);
+
+ /*
+ * If skipScanKey is NULL then we initialize it with _bt_mkscankey,
+ * otherwise we will just update the sk_flags / sk_argument elements in
+ * order to eliminate repeated free/realloc.
+ */
+ if (so->skipScanKey == NULL)
+ {
+ so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+ so->skipScanKey->keysz = prefix;
+ so->skipScanKey->scantid = NULL;
+ }
+ else
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+ }
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /* Check if the next unique key can be found within the current page */
+ if (BTScanPosIsValid(so->currPos) &&
+ _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {
+ bool keyFound = false;
+
+ LockBuffer(so->currPos.buf, BT_READ);
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /* Now read the data */
+ keyFound = _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (keyFound)
+ {
+ /* set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ return true;
+ }
+ }
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ /*
+ * We haven't found scan key within the current page, so let's scan from
+ * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+ * number
+ */
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* Lock the page for SERIALIZABLE transactions */
+ PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+ scan->xs_snapshot);
+
+ /* We know in which direction to look */
+ _bt_initialize_more_data(so, dir);
+
+ /*
+ * Simplest case is when both directions are forward, when we are already
+ * at the next distinct key at the beginning of the series (so everything
+ * else would be done in _bt_readpage)
+ *
+ * The case when both directions are backwards is also simple, but we need
+ * to go one step back, since we need a last element from the previous
+ * series.
+ */
+ if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+ offnum = OffsetNumberPrev(offnum);
+
+ /*
+ * Andvance backward but read forward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can read forward without doing anything else. Otherwise
+ * find previous distinct key and the beginning of it's series and read
+ * forward from there. To do so, go back one step, perform binary search
+ * to find the first item in the series and let _bt_readpage do everything
+ * else.
+ */
+ else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+ {
+ if (!scanstart)
+ {
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /* One step back to find a previous value */
+ _bt_readpage(scan, dir, offnum);
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (_bt_next(scan, dir))
+ {
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ /*
+ * And now find the last item from the sequence for the
+ * current, value with the intention do OffsetNumberNext. As a
+ * result we end up on a first element from the sequence.
+ */
+ if (_bt_scankey_within_page(scan, so->skipScanKey,
+ so->currPos.buf, dir))
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ else
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+ }
+ }
+ else
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Advance forward but read backward. At this moment we are at the next
+ * distinct key at the beginning of the series. In case if scan just
+ * started, we can go one step back and read forward without doing
+ * anything else. Otherwise find the next distinct key and the beginning
+ * of it's series, go one step back and read backward from there.
+ *
+ * An interesting situation can happen if one of distinct keys do not pass
+ * a corresponding index condition at all. In this case reading backward
+ * can lead to a previous distinct key being found, creating a loop. To
+ * avoid that check the value to be returned, and jump one more time if
+ * it's the same as at the beginning.
+ */
+ else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+ {
+ if (scanstart)
+ offnum = OffsetNumberPrev(offnum);
+ else
+ {
+ OffsetNumber nextOffset,
+ startOffset;
+
+ nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+ while (nextOffset == startOffset)
+ {
+ /*
+ * Find a next index tuple to update scan key. It could be at
+ * the end, so check for max offset
+ */
+ OffsetNumber curOffnum = offnum;
+ Page page = BufferGetPage(so->currPos.buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ ItemId itemid = PageGetItemId(page, Min(offnum, maxoff));
+
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
+ so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+
+ _bt_update_skip_scankeys(scan, indexRel);
+
+ if (BTScanPosIsValid(so->currPos))
+ {
+ ReleaseBuffer(so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ stack = _bt_search(scan->indexRelation, so->skipScanKey,
+ &buf, BT_READ, scan->xs_snapshot);
+ _bt_freestack(stack);
+ so->currPos.buf = buf;
+ offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+ /*
+ * Jump to the next key returned the same offset, which means
+ * we are at the end and need to return
+ */
+ if (offnum == curOffnum)
+ {
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos)
+
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+
+ /* Check if _bt_readpage returns already found item */
+ if (_bt_readpage(scan, indexdir, offnum))
+ {
+ IndexTuple itup;
+
+ currItem = &so->currPos.items[so->currPos.lastItem];
+ itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+ nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+ }
+ else
+ elog(ERROR, "Could not read closest index tuples: %d", offnum);
+
+ /*
+ * If the nextOffset is the same as before, it means we are in
+ * the loop, return offnum to the original position and jump
+ * further
+ */
+ if (nextOffset == startOffset)
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ /* Now read the data */
+ if (!_bt_readpage(scan, indexdir, offnum))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ {
+ pfree(so->skipScanKey);
+ so->skipScanKey = NULL;
+ return false;
+ }
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /* And set IndexTuple */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+ return true;
+}
+
/*
* _bt_readpage() -- Load data from current index page into so->currPos
*
@@ -2246,3 +2553,59 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
so->numKilled = 0; /* just paranoia */
so->markItemIndex = -1; /* ditto */
}
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+ TupleDesc itupdesc;
+ int indnkeyatts,
+ i;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey scankeys = so->skipScanKey->scankeys;
+
+ itupdesc = RelationGetDescr(indexRel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ Datum datum;
+ bool null;
+ int flags;
+
+ datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+ flags = (null ? SK_ISNULL : 0) |
+ (indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+ scankeys[i].sk_flags = flags;
+ scankeys[i].sk_argument = datum;
+ }
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+ Buffer buf, ScanDirection dir)
+{
+ OffsetNumber low,
+ high,
+ compare_offset;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ int compare_value = ScanDirectionIsForward(dir) ? 0 : 1;
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ if (unlikely(high < low))
+ return false;
+
+ compare_offset = ScanDirectionIsForward(dir) ? high : low;
+
+ return _bt_compare(scan->indexRelation,
+ key, page, compare_offset) > compare_value;
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 45472db147..dc151ecf09 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -64,6 +64,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amvacuumcleanup = spgvacuumcleanup;
amroutine->amcanreturn = spgcanreturn;
+ amroutine->amskip = NULL;
amroutine->amcostestimate = spgcostestimate;
amroutine->amoptions = spgoptions;
amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..b66296d6c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -130,6 +130,7 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
static void ExplainJSONLineEnding(ExplainState *es);
static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
static void escape_yaml(StringInfo buf, const char *str);
@@ -1041,6 +1042,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
}
+/*
+ * ExplainIndexSkipScanKeys -
+ * Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+ if (skipPrefixSize > 0)
+ {
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+ }
+}
+
/*
* ExplainNode -
* Appends a description of a plan tree to es->str
@@ -1363,6 +1380,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexScan *indexscan = (IndexScan *) plan;
+ ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
ExplainIndexScanDetails(indexscan->indexid,
indexscan->indexorderdir,
es);
@@ -1373,6 +1392,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+ ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
ExplainIndexScanDetails(indexonlyscan->indexid,
indexonlyscan->indexorderdir,
es);
@@ -1582,6 +1603,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
+ if (((IndexScan *) plan)->indexskipprefixsize > 0)
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
if (((IndexScan *) plan)->indexqualorig)
@@ -1595,6 +1618,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
break;
case T_IndexOnlyScan:
+ if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+ ExplainPropertyBool("Skip scan", true, es);
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 784486f0c8..985fc3c50f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,6 +65,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -72,7 +79,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -115,14 +122,50 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->ioss_SkipPrefixSize > 0 &&
+ (node->ioss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+ !node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset ioss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->ioss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ tid = &scandesc->xs_heaptid;
+ }
+ }
+
/*
* OK, now that we have what we need, fetch the next tuple.
*/
- while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+ while (skipped || (tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* We can skip the heap fetch if the TID references a heap page on
@@ -250,6 +293,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
estate->es_snapshot);
+ node->ioss_FirstTupleEmitted = true;
+
return slot;
}
@@ -504,6 +549,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+ indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->ioss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index c06d07aa46..3e82fa37c7 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+ /*
+ * tells if the current position was reached via skipping. In this case
+ * there is no nead for the index_getnext_tid
+ */
+ bool skipped = false;
/*
* extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
estate = node->ss.ps.state;
direction = estate->es_direction;
/* flip direction if this is an overall backward scan */
- if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+ if (ScanDirectionIsBackward(indexscan->indexorderdir))
{
if (ScanDirectionIsForward(direction))
direction = BackwardScanDirection;
@@ -116,6 +123,7 @@ IndexNext(IndexScanState *node)
node->iss_NumOrderByKeys);
node->iss_ScanDesc = scandesc;
+ node->iss_ScanDesc->xs_want_itup = true;
/*
* If no run-time keys to calculate or they are ready, go ahead and
@@ -127,12 +135,48 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ /*
+ * Check if we need to skip to the next key prefix, because we've been
+ * asked to implement DISTINCT.
+ *
+ * When fetching a cursor in the direction opposite to a general scan
+ * direction, the result must be what normal fetching should have
+ * returned, but in reversed order. In other words, return the last or
+ * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+ * Due to that we skip also when the first tuple wasn't emitted yet, but
+ * the directions are opposite.
+ */
+ if (node->iss_SkipPrefixSize > 0 &&
+ (node->iss_FirstTupleEmitted ||
+ ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+ {
+ if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+ !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+ {
+ /*
+ * Reached end of index. At this point currPos is invalidated, and
+ * we need to reset iss_FirstTupleEmitted, since otherwise after
+ * going backwards, reaching the end of index, and going forward
+ * again we apply skip again. It would be incorrect and lead to an
+ * extra skipped item.
+ */
+ node->iss_FirstTupleEmitted = false;
+ return ExecClearTuple(slot);
+ }
+ else
+ {
+ skipped = true;
+ index_fetch_heap(scandesc, slot);
+ }
+ }
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while (skipped || index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
+ skipped = false;
/*
* If the index was lossy, we have to recheck the index quals using
@@ -149,6 +193,7 @@ IndexNext(IndexScanState *node)
}
}
+ node->iss_FirstTupleEmitted = true;
return slot;
}
@@ -910,6 +955,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+ indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+ indexstate->iss_FirstTupleEmitted = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 2f267e4bb6..7ae5d96c07 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
COPY_NODE_FIELD(indexorderbyorig);
COPY_NODE_FIELD(indexorderbyops);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
COPY_NODE_FIELD(indexorderby);
COPY_NODE_FIELD(indextlist);
COPY_SCALAR_FIELD(indexorderdir);
+ COPY_SCALAR_FIELD(indexskipprefixsize);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1ccd68d3aa..aec39e7ba0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
WRITE_NODE_FIELD(indexorderbyorig);
WRITE_NODE_FIELD(indexorderbyops);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
WRITE_NODE_FIELD(indexorderby);
WRITE_NODE_FIELD(indextlist);
WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+ WRITE_INT_FIELD(indexskipprefixsize);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..0fc3c5ea68 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1787,6 +1787,7 @@ _readIndexScan(void)
READ_NODE_FIELD(indexorderbyorig);
READ_NODE_FIELD(indexorderbyops);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
@@ -1806,6 +1807,7 @@ _readIndexOnlyScan(void)
READ_NODE_FIELD(indexorderby);
READ_NODE_FIELD(indextlist);
READ_ENUM_FIELD(indexorderdir, ScanDirection);
+ READ_INT_FIELD(indexskipprefixsize);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..194e258dc1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -124,6 +124,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexskipscan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index aee81bd755..5b9a41ef10 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -175,12 +175,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *indexorderby,
List *indextlist,
- ScanDirection indexscandir);
+ ScanDirection indexscandir,
+ int skipprefix);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -2908,7 +2910,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexquals,
fixed_indexorderbys,
best_path->indexinfo->indextlist,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
else
scan_plan = (Scan *) make_indexscan(tlist,
qpqual,
@@ -2919,7 +2922,8 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
- best_path->indexscandir);
+ best_path->indexscandir,
+ best_path->indexskipprefix);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5182,7 +5186,8 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexScan *node = makeNode(IndexScan);
Plan *plan = &node->scan.plan;
@@ -5199,6 +5204,7 @@ make_indexscan(List *qptlist,
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
@@ -5211,7 +5217,8 @@ make_indexonlyscan(List *qptlist,
List *indexqual,
List *indexorderby,
List *indextlist,
- ScanDirection indexscandir)
+ ScanDirection indexscandir,
+ int skipPrefixSize)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
Plan *plan = &node->scan.plan;
@@ -5226,6 +5233,7 @@ make_indexonlyscan(List *qptlist,
node->indexorderby = indexorderby;
node->indextlist = indextlist;
node->indexorderdir = indexscandir;
+ node->indexskipprefixsize = skipPrefixSize;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 8f03a20825..b3ebbc5a1d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4828,6 +4828,82 @@ create_distinct_paths(PlannerInfo *root,
path,
list_length(root->distinct_pathkeys),
numDistinctRows));
+
+ /* Consider index skip scan as well */
+ if (enable_indexskipscan &&
+ IsA(path, IndexPath) &&
+ ((IndexPath *) path)->indexinfo->amcanskip &&
+ root->distinct_pathkeys != NIL)
+ {
+ ListCell *lc;
+ IndexOptInfo *index = NULL;
+ bool different_columns_order = false,
+ not_empty_qual = false;
+ int i = 0;
+ int distinctPrefixKeys;
+
+ Assert(path->pathtype == T_IndexOnlyScan ||
+ path->pathtype == T_IndexScan);
+
+ index = ((IndexPath *) path)->indexinfo;
+ distinctPrefixKeys = list_length(root->query_uniquekeys);
+
+ /*
+ * Normally we can think about distinctPrefixKeys as just
+ * a number of distinct keys. But if lets say we have a
+ * distinct key a, and the index contains b, a in exactly
+ * this order. In such situation we need to use position
+ * of a in the index as distinctPrefixKeys, otherwise skip
+ * will happen only by the first column.
+ */
+ foreach(lc, root->query_uniquekeys)
+ {
+ UniqueKey *uniquekey = (UniqueKey *) lfirst(lc);
+ EquivalenceMember *em =
+ lfirst_node(EquivalenceMember,
+ list_head(uniquekey->eq_clause->ec_members));
+ Var *var = (Var *) em->em_expr;
+
+ Assert(i < index->ncolumns);
+
+ for (i = 0; i < index->ncolumns; i++)
+ {
+ if (index->indexkeys[i] == var->varattno)
+ {
+ distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+ break;
+ }
+ }
+ }
+
+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a,
+ * b) with arbitrary one of c, so if the choosed c does
+ * not match the qual and there is any c that matches the
+ * qual, we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ list_length((List *) parse->jointree->quals) != 0)
+ not_empty_qual = true;
+
+ if (!different_columns_order && !not_empty_qual)
+ {
+ add_path(distinct_rel, (Path *)
+ create_skipscan_unique_path(root,
+ distinct_rel,
+ path,
+ distinctPrefixKeys,
+ numDistinctRows));
+ }
+ }
}
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ec02c468d0..6119d7311f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2915,6 +2915,46 @@ create_upper_unique_path(PlannerInfo *root,
return pathnode;
}
+/*
+ * create_skipscan_unique_path
+ * Creates a pathnode the same as an existing IndexPath except based on
+ * skipping duplicate values. This may or may not be cheaper than using
+ * create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *basepath,
+ int distinctPrefixKeys,
+ double numGroups)
+{
+ IndexPath *pathnode = makeNode(IndexPath);
+
+ Assert(IsA(basepath, IndexPath));
+
+ /* We don't want to modify basepath, so make a copy. */
+ memcpy(pathnode, basepath, sizeof(IndexPath));
+
+ /* The size of the prefix we'll use for skipping. */
+ Assert(pathnode->indexinfo->amcanskip);
+ Assert(distinctPrefixKeys > 0);
+ /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/
+ pathnode->indexskipprefix = distinctPrefixKeys;
+
+ /*
+ * The cost to skip to each distinct value should be roughly the same as
+ * the cost of finding the first key times the number of distinct values
+ * we expect to find.
+ */
+ pathnode->path.startup_cost = basepath->startup_cost;
+ pathnode->path.total_cost = basepath->startup_cost * numGroups;
+ pathnode->path.rows = numGroups;
+
+ return pathnode;
+}
+
/*
* create_agg_path
* Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5e889d1861..c443c1f2d3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -271,6 +271,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->amoptionalkey = amroutine->amoptionalkey;
info->amsearcharray = amroutine->amsearcharray;
info->amsearchnulls = amroutine->amsearchnulls;
+ info->amcanskip = (amroutine->amskip != NULL);
info->amcanparallel = amroutine->amcanparallel;
info->amhasgettuple = (amroutine->amgettuple != NULL);
info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4b3769b8b0..ecf4b64a76 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -916,6 +916,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index-skip-scan plans."),
+ NULL
+ },
+ &enable_indexskipscan,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index be02a76d9d..cab198feb9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,7 @@
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
+#enable_indexskipscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..f84791e358 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
typedef bool (*amgettuple_function) (IndexScanDesc scan,
ScanDirection direction);
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+ ScanDirection dir,
+ ScanDirection indexdir,
+ bool start,
+ int prefix);
+
/* fetch all valid tuples */
typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
TIDBitmap *tbm);
@@ -225,6 +232,7 @@ typedef struct IndexAmRoutine
amendscan_function amendscan;
ammarkpos_function ammarkpos; /* can be NULL */
amrestrpos_function amrestrpos; /* can be NULL */
+ amskip_function amskip; /* can be NULL */
/* interface functions to support parallel index scans */
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a813b004be..d33e995a73 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+ ScanDirection indexdir, bool start, int prefix);
extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..cf7a24444d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -662,6 +662,9 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /* Work space for _bt_skip */
+ BTScanInsert skipScanKey; /* used to control skipping */
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -776,6 +779,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
@@ -800,6 +805,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+ ScanDirection indexdir, bool start, int prefix);
extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 664e72ef5d..dff90fada1 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
#define ScanDirectionIsForward(direction) \
((bool) ((direction) == ForwardScanDirection))
+/*
+ * ScanDirectionsAreOpposite
+ * True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+ ((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
#endif /* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44f76082e9..9e6d501ad1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1428,6 +1428,8 @@ typedef struct IndexScanState
ExprContext *iss_RuntimeContext;
Relation iss_RelationDesc;
struct IndexScanDescData *iss_ScanDesc;
+ int iss_SkipPrefixSize;
+ bool iss_FirstTupleEmitted;
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
@@ -1457,6 +1459,8 @@ typedef struct IndexScanState
* TableSlot slot for holding tuples fetched from the table
* VMBuffer buffer in use for visibility map testing, if any
* PscanLen size of parallel index-only scan descriptor
+ * SkipPrefixSize number of keys for skip-based DISTINCT
+ * FirstTupleEmitted has the first tuple been emitted
* ----------------
*/
typedef struct IndexOnlyScanState
@@ -1475,6 +1479,8 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
+ int ioss_SkipPrefixSize;
+ bool ioss_FirstTupleEmitted;
Size ioss_PscanLen;
} IndexOnlyScanState;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 10ece6c875..08eb432a3b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -837,6 +837,7 @@ struct IndexOptInfo
bool amsearchnulls; /* can AM search for NULL/NOT NULL entries? */
bool amhasgettuple; /* does AM have amgettuple interface? */
bool amhasgetbitmap; /* does AM have amgetbitmap interface? */
+ bool amcanskip; /* can AM skip duplicate values? */
bool amcanparallel; /* does AM support parallel scan? */
/* Rather than include amapi.h here, we declare amcostestimate like this */
void (*amcostestimate) (); /* AM's cost estimator */
@@ -1187,6 +1188,9 @@ typedef struct Path
* we need not recompute them when considering using the same index in a
* bitmap index/heap scan (see BitmapHeapPath). The costs of the IndexPath
* itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
*----------
*/
typedef struct IndexPath
@@ -1199,6 +1203,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int indexskipprefix;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..f09c8c43a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -405,6 +405,8 @@ typedef struct IndexScan
List *indexorderbyorig; /* the same in original form */
List *indexorderbyops; /* OIDs of sort ops for ORDER BY exprs */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexScan;
/* ----------------
@@ -432,6 +434,8 @@ typedef struct IndexOnlyScan
List *indexorderby; /* list of index ORDER BY exprs */
List *indextlist; /* TargetEntry list describing index's cols */
ScanDirection indexorderdir; /* forward or backward or don't care */
+ int indexskipprefixsize; /* the size of the prefix for distinct
+ * scans */
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..9abfdfb6bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 37a946f857..09d61a8e99 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -201,6 +201,11 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
Path *subpath,
int numCols,
double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+ RelOptInfo *rel,
+ Path *subpath,
+ int numCols,
+ double numGroups);
extern AggPath *create_agg_path(PlannerInfo *root,
RelOptInfo *rel,
Path *subpath,
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index f3696c6d1d..51e12ac925 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -244,3 +244,508 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
t
(1 row)
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, tenthous, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+ QUERY PLAN
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+ Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ QUERY PLAN
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+ Skip scan: true
+ Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ QUERY PLAN
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+ Skip scan: true
+ Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a | b
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+ Skip scan: true
+ Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ QUERY PLAN
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+ Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ QUERY PLAN
+-----------------------------------------------------
+ Unique
+ -> Bitmap Heap Scan on distinct_a
+ Recheck Cond: (a = 1)
+ -> Bitmap Index Scan on distinct_a_a_b_idx
+ Index Cond: (a = 1)
+(5 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ QUERY PLAN
+---------------------------------------------------------
+ Unique
+ -> Index Scan using distinct_a_a_b_idx on distinct_a
+ Index Cond: (b = 2)
+ Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column?
+---+----------
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a | b
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ QUERY PLAN
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+ Skip scan: true
+ Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..bd3b373515 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashjoin | on
enable_indexonlyscan | on
enable_indexscan | on
+ enable_indexskipscan | on
enable_material | on
enable_mergejoin | on
enable_nestloop | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(17 rows)
+(18 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index a605e86449..4c8a50d153 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -73,3 +73,189 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
SELECT 2 IS NOT DISTINCT FROM null as "no";
SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+ SELECT five, tenthous, 10 FROM
+ generate_series(1, 5) five,
+ generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+ VALUES (1, 1, 1),
+ (1, 1, 2),
+ (1, 2, 2),
+ (1, 2, 3),
+ (2, 2, 1),
+ (2, 2, 3),
+ (3, 1, 1),
+ (3, 1, 2),
+ (3, 2, 2),
+ (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+ SELECT a, b::int2 b, (b % 2)::int2 c FROM
+ generate_series(1, 5) a,
+ generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
--
2.21.0